AUTOMATED  KNOWLEDGE  ACQUISITION 
VIA  INDUCTIVE  LEARNING  : 

AN  APPLICATION  TO 

A MARKETING  COMMUNICATIONS  EXPERT  SYSTEM 


By 

CHRYSANTHUS  SHERVANTHIEjDE  ALMEIDA 


A DISSERTATION  PRESENTED  TO 
THE  GRADUATE  SCHOOL  OF  THE  UNIVERSITY  OF  FLORIDA 
IN  PARTIAL  FULFILLMENT  OF  THE  REQUIREMENTS  FOR 
THE  DEGREE  OF  DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 


1993 


DEDICATION 


To  my  wife  Chandrika 

and  children  Charnika,  Chrishnika,  and  Chrishan 


ACKNOWLEDGEMENTS 


I wish  to  thank  Dr.  Gary  Koehler,  the  chairman  of  my  committee,  and  Dr.  Chris 
Janiszewski,  the  external  member,  who  both  have  given  me  invaluable  support  and 
guidance  throughout  this  research.  Their  insightful  comments  and  suggestions  have 
immensely  contributed  to  the  successful  completion  of  this  dissertation.  Their  time  and 
wisdom  so  willingly  and  abundantly  offered,  is  greatly  appreciated.  I also  wish  to  thank 
Drs.  Richard  Elnicki,  Mark  Pendergast  (Ex-member),  and  Antal  Majthay  for  serving  on 
my  committee  and  for  their  help  and  advice. 

My  most  heartfelt  gratitude  is  due  to  my  wife  Chandrika,  and  children,  Charnika, 
Chrishnika,  and  Chrishan,  who  have  all  shared  with  me  the  agonies  and  ecstacies  of  all 
my  endeavors,  and  without  whom  this  great  achievement  would  have  been  unthinkable. 
Their  ever-present  love,  affection,  and  understanding  were  a tremendous  source  of 
strength. 

I am  grateful  to  my  late  parents  who  have  blessed  me  with  life,  nurtured, 
nourished,  and  inspired  me  to  reach  out  for  great  achievements.  I am  indebted  to  my 
wife’s  family  for  the  strength  and  support  they  have  given  us,  and  especially,  to  my 
father-in-law  whose  encouragement  and  advice  have  always  been  in  good  stead.  1 am  also 
thankful  to  my  brother,  sisters  and  their  families  for  all  the  help  and  good  wishes  they 


have  extended  to  us. 


TABLE  OF  CONTENTS 


page 

ACKNOWLEDGEMENTS iii 

ABSTRACT vii 

CHAPTERS 

1 INTRODUCTION 1 

1.1  Background  1 

1.2  Research  Problem 3 

1.3  Purpose 5 

1.4  Motivation 5 

1.5  Chapter  Organization 6 

2 LITERATURE  REVIEW  8 

2.1  Historical  Overview  8 

2.1.1  Directions  of  Artificial  Intelligence  Research  8 

2.1.2  Early  General  Problem  Solving  Systems 9 

2.1.3  Advent  of  Expert  Systems  10 

2. 1 .4  Learning  : An  Essential  Characteristic  of  Intelligent  Systems  ....  12 

2. 1 .5  Evolution  of  Learning  Research 13 

2.2  Inductive  Learning  15 

2.2.1  Learning  Defined 15 

2.2.2  Taxonomy  of  Inductive  Learning  Methods  17 

2.2.3  Decision  Trees  19 

2.3  Classification  Trees 20 

2.3.1  Formal  Concepts 20 

2.3.2  Construction  of  Classification  Trees 25 

2.3.3  Issues  Related  to  Classification  Tree  Construction 27 

2.3.4  Performance  Evaluation  34 

2.3.5  Summary  35 

2.4  Regression  Trees 36 

2.4.1  Introduction  36 

2.4.2  Regression  Tree  Building  Procedure 37 


IV 


2.4.3  Prediction  and  Error  Estimation 41 

2.5  Business  Applications  of  Tree  Induction  41 

3 DOMAIN  PROBLEM 45 

3.1  Overview  of  Catalog  Sales 45 

3.2  Problem  Analysis  47 

3.3  Problem  Specification  49 

3.4  Benefits 51 

4 EXPERIMENTAL  METHODOLOGY,  RESULTS,  AND  ANALYSIS 

- PART  I 53 

4.1  Research  Goals 53 

4.2  Example  Databases 54 

4.3  Experimental  Methodology 58 

4.3.1  Learning  Algorithm 58 

4.3.2  Performance  Measures 59 

4.4  Preliminary  Results  and  Analysis  61 

4.4.1  Analysis  One 61 

4.4.2  Analysis  Two  64 

4.5  Detailed  Analysis  - Attention/Display  Model  68 

4.5.1  Predictive  Performance 69 

4.5.2  Attribute  Significance  and  Tree  Stability  72 

4.5.3  Rule  Abstracts 74 

5 EXPERIMENTAL  METHODOLOGY,  RESULTS,  AND  ANALYSIS 

- PART  II  77 

5.1  Methodology  77 

5.1.1  Measure  of  Predictive  Performance  77 

5.1.2  Experimental  Method 78 

5.2  Results  and  Analysis 81 

5.2.1  Predictive  Performance 81 

5.2.2  Effect  of  Sample  Size  on  Predictive  Performance  94 

5.2.3  Rule  Abstracts 95 

6 CONCLUSIONS  AND  FUTURE  RESEARCH 99 

APPENDICES 

A DISPLAY  ATTRIBUTE  DESCRIPTION  107 

B EYE  TRACKING  ATTRIBUTE  DESCRIPTION 108 


v 


C CLASSIFICATION  TREE  RULESET  FOR  1990  AND  1991 109 

D REGRESSION  TREE  RULESET  FOR  1990  AND  1991 115 

REFERENCES  122 

BIOGRAPHICAL  SKETCH 128 


vi 


Abstract  of  Dissertation  Presented  to 
the  Graduate  School  of  the  University  of  Florida 
in  Partial  Fulfillment  of  the  Requirements  for 
the  Degree  of  Doctor  of  Philosophy 

AUTOMATED  KNOWLEDGE  ACQUISITION 
VIA  INDUCTIVE  LEARNING  : 

AN  APPLICATION  TO 

A MARKETING  COMMUNICATIONS  EXPERT  SYSTEM 

By 

Chrysanthus  Shervanthie  de  Almeida 
August  1993 


Chairman:  Dr.  Gary  J.  Koehler 

Major  Department:  Decision  and  Information  Sciences 

Interest  in  the  use  of  expert  systems  in  various  business  domains  has  grown  very 
rapidly.  However,  the  progress  made  in  the  development  of  these  systems  has  been  slow 
because  of  difficulties  associated  with  knowledge  acquisition.  Inductive  learning  is  a 
promising  alternative  for  knowledge  acquisition  and  knowledge  base  refinement,  which 
will  greatly  simplify  the  process  of  developing  and  maintaining  knowledge  bases. 

In  this  dissertation  we  have  investigated  the  application  of  inductive  learning 
methodologies  using  decision  trees  in  the  context  of  the  influence  of  marketing 
communications  on  consumer  decision  making.  The  domain  problem  that  we  have 
addressed  is  a very  complex  and  unstructured  problem  concerning  the  design  of  sales 
catalogs.  Three  specific  relationships  were  of  interest,  namely,  the  influence  of  a catalog 
page  design  on  directing  the  consumers’  attention  to  products  displayed  on  that  page,  the 


vii 


relationship  between  the  amount  of  attention  a product  receives  and  its  sales,  and  the 
direct  relationship  between  the  catalog  page  design  and  sales.  This  is  a novel  problem, 
in  that  very  little  work  has  been  done  on  this  in  the  past,  and  the  domain  theory  is  not 
well  developed.  It  is  important  because  catalog  sales  are  part  of  a growing  industry  with 
increasing  investment,  and  any  knowledge  that  would  lead  to  improving  sales  and 
reducing  costs  would  have  a tremendous  value. 

The  main  contributions  of  this  study  are  twofold.  First,  it  demonstrates  the 
relevance  of  machine  learning  in  solving  concrete  business  problems,  and  promotes  our 
knowledge  and  understanding  of  the  learning  behavior  of  rule  induction  methods  in 
complex  and  fuzzy  situations.  This  will  enhance  our  ability  to  adapt  and  use  the 
methodologies  in  different  problem  domains  under  similar  circumstances.  Second,  it  adds 
to  the  domain  knowledge  that  would  otherwise  be  difficult  to  acquire.  The  insightful 
relationships  extracted  from  a collection  of  data  and  represented  symbolically  will  aid 
immensely  in  managerial  decision-making. 


viii 


CHAPTER  1 
INTRODUCTION 

1.1  Background 

I think  this  introduction  is  an  appropriate  place  to  begin  explaining  the  meaning 
of  the  title  of  my  dissertation.  It  is  also  appropriate  to  provide  a very  brief  overview  of 
the  research  problem,  the  goals  of  the  research,  and  my  motivation  for  doing  this  research. 
Finally,  I should  follow  up  with  the  organization  of  the  chapters  of  this  dissertation  so 
that  any  interested  reader  can  easily  find  anything  specific  related  to  the  research. 
Throughout  the  dissertation,  I have  attempted  to  keep  the  discussion  simple,  focussed,  and 
logical.  I hope  that  the  reader  finds  the  scope  and  the  content  of  this  report  interesting, 
comprehensible,  and  insightful. 

An  expert  system  (ES),  as  we  know,  is  a computer  system  or  program  that  can 
emulate  human-like  capabilities  such  as  reasoning,  problem-solving,  etc.  at  the  level  of 
an  expert.  Expert  systems  belong  to  the  more  general  class  of  knowledge-based  systems 
(KBS),  where  the  level  of  performance,  though  human-like,  need  not  be  that  of  an  expert. 
Knowledge-based  systems  or  expert  systems  are  different  to  traditional  artificial 
intelligence  (AI)  programs  in  that  knowledge  structure  and  the  control  structure  are 
separated.  This  abstraction  allows  the  development  of  the  knowledge  base  independent 
of  the  control  structure. 


1 


2 


A marketing  communication  is  a message  offering  products  and  services  that  is 
delivered  by  the  marketeer  to  potential  consumers  with  the  specific  intention  of  affecting 
the  purchasing  behavior  of  those  consumers.  The  effectiveness  of  the  communication 
depends  on  its  design.  The  design  can  create  sufficient  interest  in  the  commodity  that 
would  facilitate  the  purchasing  decision.  Thus,  well-designed  communications  can 
augment  sales. 

The  design  is  the  creation  of  an  expert  designer.  The  knowledge  and  expertise  of 
the  designer,  however  subjective,  is  brought  to  bear  on  the  design.  Yet,  there  is  no 
concrete  evidence  of  the  ways  in  which  these  designs  facilitate  the  decision  making 
process  of  the  consumer,  nor  are  there  objective  measures  to  evaluate  such  an  effect.  The 
kind  of  expertise  that  could  relate  the  design  features  to  the  sales  outcome  is  lacking. 
Thus,  the  task  of  a marketing  communications  expert  system  would  be  to  somehow 
emulate  such  dubious  expertise. 

The  question,  then  is,  how  can  we  capture  and  encapsulate  this  kind  of 
knowledge?  Knowledge  acquisition  is  the  process  by  which  knowledge  is  elicited  and 
represented  in  a form  that  can  be  manipulated  by  a computer.  In  our  case,  the  traditional 
knowledge  engineering  techniques  of  eliciting  knowledge  from  experts  via  interviews, 
protocol  analysis,  etc.,  are  immediately  ruled  out.  Semi-automated  techniques,  too,  where 
the  system  interactively  elicits  knowledge  from  the  expert,  are  ruled  out  by  the  same 
token.  The  only  viable  alternative  left  is  automated  knowledge  acquisition. 

Automated  knowledge  acquisition  is  a process  by  which  the  system  learns  or 
acquires  knowledge  automatically  from  the  environment.  Very  little,  if  at  all,  intervention 


3 


of  the  knowledge  engineer  or  the  expert  is  required.  There  are  the  deductive  methods  and 
the  inductive  methods  of  learning.  Deductive  methods  rely  on  some  background 
knowledge  of  the  domain  to  facilitate  the  learning  process.  In  a task  domain  where  the 
domain  theory  is  not  very  well-developed,  deductive  methods  may  not  at  all  prove  to  be 
feasible.  Thus,  our  search  for  a suitable  knowledge  acquisition  method  is  narrowed  down 
to  just  inductive  learning. 

Induction,  as  the  1979  Webster’s  New  Collegiate  Dictionary  describes,  is  the  act 
of  reasoning  from  a part  to  the  whole,  from  particulars  to  generals,  or  from  the  individual 
to  the  universal.  Inductive  learning  is  the  process  that  facilitates  this  action  in  computer 
systems.  Inductive  learning  can  be  used  in  our  domain  problem  provided  a sufficiently 
large  set  of  pertinent  observations  or  examples  of  previous  behavior  is  available.  The 
"pertinence"  issue  is  important  because,  at  its  present  state,  learning  systems  do  not 
perform  very  well  when  inundated  with  impertinent  information.  The  expert’s  intervention 
is  required  at  this  stage. 

Thus,  the  title,  "Automated  Knowledge  Acquisition  via  Inductive  Learning  : An 
Application  to  a Marketing  Communications  Expert  System"  very  uniquely  captures  the 
essence  of  the  problem  that  we  have  addressed,  both  by  design  and  by  choice. 

1.2  Research  Problem 

Selling  merchandize  via  direct  mail-order  catalogs  is  a growing  multi-billion  dollar 
industry.  Sales  catalogs  are  designed  to  communicate  information  about  each  product, 
including  its  appearance,  cost,  range  of  colors  and  sizes,  other  complementary  and 


4 


complimentary  products  and  accessories,  and  other  pertinent  information.  The  objective 
of  the  design  is  to  influence  the  purchase  decision  of  the  consumer. 

There  may  be  salient  relationships  among  the  display  feature  of  products,  the 
attention  received  by  these  products,  and  their  sales  outcome.  Knowing  and 
understanding  these  relationships  is  very  essential  to  create  effective  designs,  and  thereby, 
control  attention  and  sales.  Three  types  of  relationships  need  to  be  modelled.  Firstly,  the 
relationship  between  a product’s  display  attributes  and  the  amount  of  attention  it  receives. 
Secondly,  the  association  between  attention  and  the  sales  outcome  of  the  product.  Finally, 
the  direct  relationship  between  the  display  attributes  and  the  sales  of  products. 

In  attempting  to  model  these  relationships,  we  have  investigated  the  use  of 
inductive  learning  via  decision  trees.  The  investigation  is  two-stepped.  In  the  first  step, 
we  cast  the  problem  as  a classification  problem  and  we  use  an  appropriate  classification 
tree  technique  to  solve  it.  In  the  second  step,  we  cast  the  problem  as  a prediction  problem 
and  we  use  a regression  tree  technique.  We  then  compare  the  performance  of  these 
techniques  against  benchmark  statistical  techniques.  We  also  analyze  our  results  and 
attempt  to  gain  insights  into  the  problem  domain  as  well  as  the  behavior  of  learning 
systems  in  the  presence  of  the  idiosyncratic  characteristics  associated  with  this  type  of 
real  problems. 

The  benefits  of  this  research  may  be  extended  to  other  similar  or  related  problem 
domains  such  as  television  commercials,  print  advertisements,  electronic  shopping,  etc. 
This  problem  is  of  interest  because  the  domain  theory  is  not  well  developed.  In  such 


5 


problems  data-driven  search  techniques,  such  as  inductive  learning,  have  been  found  to 
be  particularly  good. 


1.3  Purpose 

We  have  adopted  a very  empirical  approach  in  this  study.  Beginning  with  a 
problem  to  be  solved  we  have  attempted  to  find,  through  scientific  intuition  and 
experimentation,  a solution  using  new  AI  methodologies  of  learning.  In  doing  so  we  have 
gained  a better  understanding  of  the  processes  involved  in  learning  as  well  as  the  domain. 
The  major  goals  of  the  study  are  the  following: 

• Demonstrate  the  relevance  of  machine  learning  techniques  in  solving  concrete 
problems. 

• Investigate  the  concept  learning  behavior  based  on  rule  induction  using  decision 
trees. 

• Investigate  insightful  relationships  in  the  domain  that  would  contribute  towards 
supporting  or  refuting  existing  domain  theories  or  lead  to  new  theories. 

• Develop  and  validate  the  knowledge  base  that  would  capture  the  essence  of  the 
domain  problem  and  the  structure  implicit  in  the  data. 

1.4  Motivation 

Interest  in  the  use  of  expert  systems  in  various  business  domains  have  grown  very 
rapidly.  However,  the  progress  made  in  the  development  of  these  systems  has  been  slow 
because  of  difficulties  associated  with  knowledge  acquisition.  Automated  knowledge 


6 


acquisition  in  general  and  inductive  learning  in  particular  are  promising  alternatives  for 
knowledge  acquisition  and  knowledge  base  refinement,  which  will  greatly  simplify  the 
process  of  developing  and  maintaining  knowledge  bases. 

Exploring  the  application  of  artificial  intelligence  methodologies  in  human 
problem  solving  tasks  will  prove  to  be  very  useful,  and  provide  a fertile  area  of  research 
for  strategic  and  competitive  use  of  artificial  intelligence  in  information  systems. 
Traditional  statistical  methods  used  in  problem-solving  such  as  discriminant  analysis, 
cluster  analysis,  and  regression  rely  heavily  on  quantitative  measures  of  a-priori 
determined  variables.  Qualitative  analysis  and  reasoning  is  left  to  the  analyst  or  the 
decision  maker.  Pre-eminence  of  the  artificial  intelligence  approach  to  problem  solving 
is  such  that  qualitative  component  of  human  information  processing  and  decision  making 
is  implicit  in  the  model.  Furthermore,  statistical  methods  rely  on  fundamental  assumptions 
on  the  underlying  distribution  of  the  description  space,  which  are  never  quite  satisfied  in 
real-world  data.  AI  methodologies  make  no  such  assumptions,  and  thus,  can  be  applied 
without  invalidating  the  model. 


1.5  Chapter  Organization 

Chapter  two  reviews  the  literature  on  rule  induction  using  decision  trees.  The 
chapter  begins  with  a historical  overview  of  artificial  intelligence,  expert  systems,  and 
machine  learning  research,  followed  by  definitions  of  learning  and  some  formal  concepts. 
Then,  decision  tree  induction  processes,  both  for  classification  and  prediction,  and  their 


7 


related  issues  are  discussed.  The  chapter  ends  with  a discussion  of  decision  tree  induction 
applications  in  business. 

Chapter  three  focusses  on  the  problem  domain.  It  provides  an  overview  of  the 
sales  catalog  design  process,  followed  by  an  analysis  of  the  problem,  the  specification  of 
the  problem.  Finally,  some  of  the  ensuing  benefits  of  the  research  effort  are  listed. 

Chapter  four  and  five  provides  details  of  the  experimental  methodology,  results, 
and  analysis.  Fourth  chapter  is  dedicated  to  classification  trees  using  ID3,  and  the  fifth 
is  dedicated  to  regression  trees.  These  two  chapters  contain  a comprehensive  discussion 
of  the  issues. 

Finally,  chapter  six  provides  a summary  and  conclusion  of  the  research,  and 
discusses  future  potential  research. 


CHAPTER  2 
LITERATURE  REVIEW 


2,1  Historical  Overview 

2.1.1  Directions  of  Artificial  Intelligence  Research 

The  directions  of  artificial  intelligence  (AI)  research  have  been  twofold.  One  has 
been  to  build  computer  models  that  help  in  understanding  the  nature  of  intelligent  activity, 
and  thus,  are  modelled  after  the  organization  of  the  human  mind  and  cognitive  processes. 
The  other  has  been  to  develop  computer  systems  that  demonstrate  certain  levels  of  human 
intelligence  in  solving  useful  problems,  and  thus,  are  not  particularly  concerned  about  the 
similarities  of  these  programs  to  human  mental  architecture.  While  the  two  reciprocate, 
in  that  the  knowledge  of  one  leads  to  a better  understanding  of  the  other,  our  interest  in 
this  study  is  towards  the  latter  because,  typically,  business  applications  of  AI  ought, 
primarily,  to  deal  with  finding  solutions  that  are  as  good  or  even  better  than  solutions 
found  by  other  means. 

Knowledge  representation  and  search  were  the  two  most  fundamental  concerns  of 
early  AI  researchers.  The  former  addresses  the  problem  of  capturing  the  knowledge 
required  for  intelligent  behavior  in  a language  suitable  for  computer  manipulation.  The 
latter  addresses  the  issues  concerning  the  problem  solving  process,  i.e.  the  search 
techniques  that  systematically  explore  a space  of  problem  states  and  manipulates  the 


8 


9 


knowledge  resident  in  the  system.  The  salient  difference  between  conventional  programs 
and  AI  programs  is  that  the  former  is  algorithmic  while  the  latter  is  heuristic.  An 
algorithmic  method  tends  to  search  for  the  best  solution  regardless  of  the  time  and  space 
complexity.  A heuristic  method  tends  to  find  a satisfactory  solution,  which  may  not 
always  be  the  best,  but  a solution  that  can  be  reached  economically  and  efficiently.  In 
certain  situations  the  cost  may  not  justify  the  best  solution,  and  therefore,  a heuristic 
solution  may  be  more  appropriate. 

2,1.2  Early  General  Problem  Solving  Systems 

Until  the  1970’s,  AI  researchers  were  mainly  interested  in  programs  that  were 
general  problem  solvers.  In  these  systems,  the  problem  solving  heuristics  and  the  control 
structure  were  intertwined.  The  earliest  AI  programs  were  attempts  at  solving  puzzles  or 
common  board  game  problems.  Instead  of  carrying  out  an  exhaustive  search  of  the  entire 
problem  space,  as  was  done  in  conventional  programs,  these  programs  used  general 
problem  solving  heuristics  or  a set  of  well-defined  rules  to  generate  the  search  space. 
These  heuristics  determine  what  alternatives  were  to  be  explored  in  the  search  space. 
According  to  Newell  and  Simon  (1976),  human  problem  solving  involves  searching  the 
problem  space  for  alternative  solutions  and  then  searching  the  solution  space  for  the  best 
alternative. 

Among  the  earliest  AI  systems  were  the  ones  that  could  perform  automated  reasoning 
or  theorem  proving.  The  theory  of  problem  solving  developed  by  Allen  Newell  and 
Herbert  Simon  in  1956  resulted  in  Logic  Theorist  (LT)  (Newell  and  Simon,  1963a).  The 


10 


problem  with  LT  was  that,  unlike  humans,  it  worked  backwards  from  theorem  to  axioms. 
The  theory  was  then  revised  and  implemented  via  the  program.  General  Problem  Solver 
(GPS),  which  used  means-ends  analysis  as  its  control  process  (Newell  and  Simon,  1963b). 
These  programs  were  designed  to  derive  theorems  from  basic  axioms  using  formal 
mathematical  logic.  Theorem  proving  can  be  used  in  a wide  variety  of  situations  by 
representing  the  problem  as  a set  of  logical  axioms  and  treating  problem  instances  as 
theorems  to  be  proved. 

A major  drawback  of  these  theorem  proving  programs  is  that  they  lacked  powerful 
heuristics,  and  were  found  to  perform  inconsistently  and  inefficiently  when  solving 
complex  problems.  They  tended  to  prove  many  irrelevant  theorems  before  they  could 
prove  the  correct  one. 

2.1.3  Advent  of  Expert  Systems 

Lessons  learned  from  the  failures  of  early  problem  solving  systems  in  solving 
complex  problems  resulted  in  recognizing  the  importance  of  the  domain-specific 
knowledge  over  general  problem  solving  skill.  This  led  to  the  development  of  a class  of 
computer  programs  in  the  1970’s,  which  came  to  be  known  as  "expert  systems"  (ES). 
These  addressed  much  more  broader  and  deeper  issues  related  to  complex  domain-specific 
problem  solving  knowledge  used  by  human  experts.  An  expert  system  relies  on  the 
knowledge  of  a human  expert  for  its  problem  solving  strategies.  The  salient  feature  of 
knowledge-based  systems  (KBS)  is  the  separation  of  the  knowledge  and  the  control 
structure.  Thus,  for  the  first  time,  it  became  possible  for  the  knowledge  to  evolve 


11 


independent  of  the  control  structure.  Cohen  and  Feigenbaum  (1982)  note  that  the  power 
of  an  AI  program  is  directly  proportional  to  what  it  knows. 

Among  the  landmark  expert  systems  that  were  developed  initially  are  DENDRAL 
(Lindsay  et  al.,  1980)  and  MYCIN  (Buchanan  and  Shortliffe,  1984).  DENDRAL 
developed  in  the  late  1960s  is  one  of  the  earliest  known  systems  that  used  domain- 
specific  knowledge  in  problem  solving.  It  emphasized  the  power  of  specialized  knowledge 
over  generalized  problem  solving  methods.  DENDRAL  was  designed  to  infer  the  identity 
of  chemical  structures  from  chemical  formulas  and  mass  spectrographic  information  about 
chemical  bonds.  DENDRAL  used  the  heuristic  knowledge  of  an  expert  chemist  to  search 
a large  space  of  chemical  structures.  One  of  the  major  concerns  in  DENDRAL  was  the 
representation  of  specialized  knowledge,  which  led  to  the  discovery  of  production  rules 
as  a powerful  form  of  representation. 

MYCIN,  according  to  Newell  (Buchanan  and  Shortliffe,  1984),  is  the  original 
expert  system  that  made  it  evident  to  all  the  rest  of  the  world  that  a new  niche  had 
opened  up.  MYCIN  was  designed  to  solve  the  problem  of  diagnosing  and  recommending 
treatment  for  meningitis  and  bacteremia.  The  main  contributions  of  MYCIN  were,  firstly, 
reasoning  with  uncertain  and  incomplete  information,  and  secondly,  providing  a clear  and 
logical  explanation  of  the  reasoning  process. 

Other  well  known  expert  systems  include  the  PROSPECTOR  (Duda  et  al.  1979) 
for  determining  the  probable  location  and  type  of  ore  deposits  based  on  geological 
information,  INTERNIST  for  performing  diagnosis  in  the  area  of  internal  medicine, 


12 


XCON  for  configuring  VAX  computers,  and  MACSYMA  for  performing  symbolic 
integration  of  mathematical  functions. 

This  breakthrough  not  only  proved  AI  to  be  a viable  discipline  with  an  extremely 
high  potential  for  practical  applicability,  but  also  spurred  the  interest  of  business  and 
industry.  With  the  emergence  of  expert  systems  as  the  most  successful  and  profitable  area 
of  artificial  intelligence,  the  field  of  AI  underwent  a significant  transformation.  It  began 
to  flourish  outside  research  laboratories  in  real  world  environments. 

2.1.4  Learning  : An  Essential  Characteristic  of  Intelligent  Systems 

A major  shortcoming  of  these  knowledge-based  systems  is  their  inability  to  learn 
from  experience.  Once  a knowledge  base  is  constructed  it  remains  static  unless  a 
knowledge  engineer  intervenes  to  modify  it.  The  knowledge  gained  by  solving  problems 
is  never  incorporated  into  the  knowledge  base  to  be  used  in  subsequent  problem  solving 
tasks  as  would  be  the  case  with  a human  expert. 

Long  before  the  advent  of  expert  systems,  the  AI  community  had  begun  to  regard 
learning  as  a hallmark  of  intelligent  behavior.  AI  researchers  sought  to  understand  the 
process  of  learning  and  to  create  computer  programs  that  can  learn.  Learning  research, 
like  its  parent  AI  research,  have  dual  purposes.  One,  to  understand  the  human  learning 
process,  and  two,  to  provide  computer  systems  with  the  ability  to  learn. 

Not  only  is  learning  an  important  element  of  intelligent  behavior,  it  is  also  a 
difficult  problem  for  AI  programs,  and  hence,  machine  learning  has  been  recognized  as 
a central  research  discipline  within  the  AI  community  since  the  mid  50’s.  With  the 


13 


development  of  expert  systems  in  the  70 ’s,  learning  became  even  more  important  because 
of  the  problems  associated  with  traditional  knowledge  acquisition  processes.  Machine 
learning  offers  a viable  alternative  to  what  Feigenbaum  (1981)  refers  to  as  the 
’bottleneck’  problem  of  knowledge  engineering. 

2.1.5  Evolution  of  Learning  Research 

Machine  learning  research  has  been  concerned  with  developing  programs  able  to 
construct  new  knowledge  or  refine  existing  knowledge.  Cohen  and  Feigenbaum  (1982) 
describes  the  evolution  of  learning  research  in  three  stages.  The  first  stage  centered  on 
self-organizing  systems  that  modified  themselves  to  adapt  to  their  environments.  Among 
the  self  organizing  systems,  Perceptron  (Rosenblatt,  1957)  was  developed  as  a 
computational  analogue  of  neurons.  But  the  interest  soon  died  because  of  the  theoretical 
limitations  shown  by  Minsky  and  Papert  (1969),  regarding  the  inability  of  the  Perceptron 
to  effectively  learn  concepts  that  are  not  linearly  separable.  These  failed  to  produce 
systems  of  any  complexity  or  intelligence.  Interest  in  Artificial  Neural  Networks  (ANN) 
resurged  in  the  past  decade  as  connectionist  networks  with  hidden  units  able  to  compute 
and  learn  non-linear  functions  were  developed  (Rumelhart  and  McClelland,  1986). 

In  the  1970’s,  interest  in  learning  research  was  renewed,  and  the  common  view 
among  learning  researchers  that  a learning  system  cannot  be  expected  to  learn  high  level 
concepts  by  starting  without  any  knowledge  at  all,  led  to  the  study  of  either  simple 
learning  problems  or  to  incorporate  large  knowledge  bases  into  learning  systems.  Among 
the  learning  programs  that  incorporated  a large  amount  of  knowledge  are  the  Automated 


14 


Mathematician  or  the  AM  (Lenat,  1977,  1982),  designed  to  discover  mathematical  laws 
from  fundamental  concepts  and  axioms  in  set  theory,  and  Meta-DENDRAL  (Buchanan 
and  Mitchell,  1978)  that  learns  rules  for  interpreting  mass  spectrographic  data  in  organic 
chemistry  from  examples  of  data  on  known  compounds  of  known  structure. 

Finally,  the  current  learning  research  driven  by  the  need  to  acquire  knowledge  for 
expert  systems  is  looking  at  all  forms  of  learning,  i.e.  rote  learning,  learning  by  being 
told,  learning  from  examples,  learning  by  analogy,  and  their  extensions  and  combinations. 
Of  the  different  forms  of  learning,  learning  from  examples  is  the  area  that  has  been 
studied  the  most.  Several  methodologies  have  been  developed  within  this  paradigm.  The 
ID3  (Quinlan,  1979),  AQ11  (Michalski,  1983),  CART  (Breiman  et  al.,  1984),  Artificial 
Neural  Networks  (ANN)  (Rumelhart  and  McClelland,  1986),  Genetic  Algorithms  (GA) 
(Holland,  1986)  are  the  more  promising  methodologies. 

In  the  past,  most  research  in  machine  learning  have  had  an  empirical  focus 
(Michalski,  1983,  1988;  Fisher,  1990;  Langley,  1989;  Quinlan,  1979,  1986,  1987; 
Mingers,  1989a,  1989b),  and  it  is  only  recently  that  progress  has  been  made  in  the 
theoretical  front  (Valiant,  1983).  Still,  most  algorithms  remain  too  complex  for  formal 
analysis,  and  hence  machine  learning  research  will  continue  to  have  a significant 
empirical  component  in  the  foreseeable  future. 

Though  machine  learning  has  begun  to  pay  off  in  various  ways,  very  little  work 
has  been  done  in  studying  the  applicability  of  these  methodologies  for  constructing 
knowledge  bases  in  the  business  domain.  Few  studies  have  demonstrated  the  relevance 
of  learning  in  finding  good  solutions  for  classification  and  prediction  problems  (Messier 


15 


and  Hansen,  1988;  Shaw  and  Gentry,  1988;  Braun  and  Chandler,  1989;  Currim  et  al., 
1988,  Tam,  1991).  The  knowledge  acquisition  bottleneck  has  created  a very  critical  need 
for  applying  these  learning  methodologies  to  problems  of  varying  complexities  in  different 
domains  and  solving  them,  and  also  for  understanding  the  behavior  of  the  systems  in 
these  domains. 


2.2  Inductive  Learning 


2.2.1  Learning  Defined 

Learning  has  been  defined  in  many  ways.  Simon  (1983)  defined  learning  as  any 
change  in  a system  that  allows  it  to  perform  better  on  repetition  of  the  same  task  or  on 
another  task  drawn  from  the  same  population.  Essentially,  this  definition  refers  to  the 
improvement  of  a system’s  performance  with  task  repetition.  Alternatively,  the  expert 
systems  community  commonly  views  learning  as  the  acquisition  of  explicit  knowledge. 
Other  definitions  refer  to  learning  as  the  process  of  skills  acquisition  or  as  the  process  of 
theory  formation,  hypothesis  formation,  and  inductive  inference  (Cohen  and  Feigenbaum, 
1982). 

A learning  system  performs  learning  by  extracting  knowledge  from  the 
environment  and  representing  it  in  a form  that  could  be  used  by  an  inference  mechanism. 
Based  on  the  level  at  which  knowledge  is  available  in  the  environment  and  the  level  at 
which  knowledge  is  used  by  the  inference  mechanism,  four  generally  accepted  forms  of 
learning  (i.e.  rote  learning,  learning  by  being  told,  learning  from  analogy,  and  learning 


16 


from  examples)  have  been  described  in  the  literature.  Most  of  the  current  research  is 
directed  towards  learning  from  examples  or  inductive  learning. 

Inductive  learning  is  defined  as  a process  of  acquiring  knowledge  by  drawing 
inductive  inferences  from  teacher-  or  environment-provided  facts  (Michalski,  1983). 
Inductive  learning  is  viewed  by  Mitchell  (1982)  as  learning  generalizations,  i.e.  to  take 
into  account  a large  number  of  specific  observations,  and  then  extract  and  retain  important 
common  features  that  characterizes  classes  of  these  observations.  Inductive  learning  or 
concept  learning  can  also  be  viewed  as  the  extrapolation  over  the  domain  of  interest,  of 
an  unknown  set,  from  a given  collection  of  observations.  As  seen  in  these  definitions,  the 
underlying  assumption  of  inductive  learning  is  that  the  learning  system  is  provided  with 
a pre-classified  or  tutored  set  of  observations  as  input,  which  represents  experiential 
knowledge. 

In  recent  literature  (Valiant,  1984;  Hausler  1988;  Natarajan,  1991),  the 
probabilistic  notion  of  learning  has  been  propounded.  Given  some  examples  of  an 
unknown  target  concept  and  some  prior  information  on  it,  learning  is  defined  as  the  task 
of  computing  a good  approximation  of  this  concept.  Prior  information,  as  suggested  by 
domain  theory,  reduces  the  number  of  observations  needed  for  a good  approximation.  This 
definition  does  not  assume  that  a perfectly  complete  and  consistent  concept  can  always 
be  learned,  i.e.  the  learned  concept  need  not  cover  all  members  of  the  concept  nor 


disallow  all  non-members. 


17 


2.2.2  Taxonomy  of  Inductive  Learning  Methods 

Inductive  learning  methods  can  be  classified  into  two  categories  based  on  the 
knowledge  structures  used  to  represent  the  acquired  knowledge.  The  symbolic  methods, 
to  which  the  AQ  family  (Michalski,  1983)  and  the  family  of  Concept  Learning  Systems 
(CLS)  or  ID3-types  (Quinlan,  1979)  belong,  represent  their  knowledge  in  a form  that  is 
easily  understood  by  human  beings,  such  as  logic  constructs  or  decision  trees.  The 
subsymbolic  methods,  to  which  the  artificial  neural  networks  (ANN)  or  connectionist 
networks  (Rumelhart  and  McClelland,  1986)  and  the  genetic  algorithms  (GA)  (Holland, 
1986)  belong,  represent  their  knowledge  in  a manner  that  is  not  easily  understood,  such 
as  numerical  weights  and  binary  strings. 

The  symbolic  methods  satisfy  the  comprehensibility  criteria  proposed  by  Michalski 
(1983)  forjudging  symbolic  learning  methods.  According  to  this  criteria,  the  knowledge 
created  by  learning  programs  should  be  easy  for  humans  to  interpret  and  understand,  and 
the  concepts  should  directly  correspond  to  those  used  by  humans.  This  is  important, 
especially,  in  applications  such  as  expert  systems  where  the  knowledge  needs  to  be 
understood,  interpreted,  and  explained. 

Though  subsymbolic  methods  may  be  good  as  problem-solving  tools,  they  are  not, 
in  their  present  state,  very  good  as  knowledge  acquisition  tools  for  expert  systems, 
because  expert  systems  require  knowledge  that  could  be  explained.  Further  work  has  to 
be  done  to  map  the  knowledge  acquired  by  subsymbolic  methods  to  symbolic 
representations  before  they  can  be  used  in  the  traditional  expert  systems. 


18 


Knowledge  acquired  by  subsymbolic  methods  cannot  be  shared  because 
subsymbolic  methods  do  not  satisfy  the  knowledge  representation  hypothesis  (Reichgelt, 
1991).  According  to  this  hypothesis,  an  intelligent  system  is  assumed  to  contain  as  a 
substructure  a knowledge  base,  which  is  more  or  less  a direct  encoding  of  the  knowledge. 
This  knowledge  base  is  manipulated  by  a separate  substructure,  which  is  often  called  the 
inference  engine.  Connectionist  systems  reject  the  notion  of  separate  substructure  for 
representing  and  manipulating  knowledge.  Such  knowledge  is  far  more  difficult  to  be 
interpreted  or  used  by  some  other  part  of  the  performance  system. 

Another  dimension  along  which  inductive  learning  methods  are  differentiated  is 
the  mode  of  learning  (i.e.  whether  learning  is  incremental  or  not).  Non-incremental 
methods  require  all  the  observations  necessary  to  learn  a concept  before  the  learning 
begins.  If  new  evidence  is  found  that  suggests  the  concept  to  be  incorrect  or  inaccurate, 
this  new  knowledge  cannot  be  accommodated  incrementally  without  reconstructing  the 
entire  knowledge  base.  Thus,  past  experience  has  to  be  retained  in  its  disaggregated  form 
as  examples.  Additional  computational  effort  is  also  required  to  reconstruct  the  knowledge 
base  from  scratch.  Incremental  methods  (Utgoff,  1989),  on  the  other  hand,  do  not  have 
this  problem.  New  knowledge  can  be  incorporated  to  existing  knowledge  based  only  on 
the  new  evidence.  However,  the  incremental  methods  may  require  far  more  computational 
effort  in  order  to  exercise  this  capability,  that  the  costs  might  far  outweigh  the  benefits 
of  incremental  learning. 

There  are  also  the  special  purpose  inductive  systems  such  as  meta-DENDRAL  and 
the  general  purpose  systems  such  as  ID3,  INDUCE,  etc.  Meta-DENDRAL  is  designed  for 


19 


the  specific  task  of  discovering  rules  by  which  molecular  structures  could  be  deduced.  It 
uses  a weak  model  of  the  domain  to  search  for  a stronger  model.  General  purpose  systems 
do  not  rely  on  any  domain  knowledge  for  inferring  new  knowledge  but  purely  on  a 
general  heuristic. 

Inductive  learning  systems  are  also  categorized  as  supervised  and  unsupervised 
learning.  Thus  far,  all  our  definitions  and  characterizations  referred  to  supervised  learning, 
i.e.  learning  from  preclassified  examples.  In  unsupervised  learning  or  learning  from 
observations,  the  input  data  in  the  training  set  are  unclassified  observations.  In  contrast 
to  concept  acquisition  in  supervised  learning,  the  task  of  unsupervised  learning  is  concept 
formation  or  empirical  discovery  (Michalski  & Stepp,  1983;  Falkenhainer  & Michalski, 
1986;  Fisher,  1987;  Lebowitz,  1987;  Gennari  et  al.,  1989;  Langley  & Zytkow,  1989; 
Fisher  & Langley,  1990).  Empirical  discovery  is  a first  step  towards  theory  formation 
(Kelly,  1990;  Falkenhainer  & Rajamoney,  1990). 

The  goal  of  unsupervised  learning  is  similar  to  statistical  clustering,  but  the 
approach  is  different.  Instead  of  numerical  measures  of  similarity  that  are  used  in 
clustering,  qualitative  measures  are  used  in  unsupervised  learning. 

2.2,3  Decision  Trees 

One  of  the  best  known  approaches  to  inductive  learning  involves  construction  of 
decision  trees  to  represent  concepts.  In  order  to  do  this,  the  decision  trees  must  capture 
some  meaningful  relationship  between  an  object’s  class  and  its  attribute  values.  Among 
the  first  methods  for  developing  such  decision  trees  was  the  Concept  Learning  System 


20 


(CLS)  developed  by  Hunt  in  1962.  Later,  Quinlan  (1979)  developed  the  Iterative 
Dichotomizer  3 (ID3)  system.  The  ID3  algorithm  has  undergone  several  revisions,  each 
overcoming  some  of  the  shortcomings  of  the  previous  version,  and  the  newer  version  is 
known  as  C4.5. 

Decision  trees  are  relatively  simple  and  economical  formalisms  that  are  used  to 
represent  acquired  knowledge  or  decision  procedures  for  determining  the  class  of  a given 
instance  or  for  predicting  the  value  of  an  unknown  attribute.  Its  computation  time 
increases  only  linearly  with  the  size  of  the  training  set,  the  number  of  attributes,  and  the 
size  (number  of  nodes)  of  the  tree.  However,  it  lacks  the  expressive  power  of  some  other 
representation  formalisms  such  as  semantic  networks,  frames,  and  first-order  logic. 

Decision  Trees  can  be  divided  into  two  main  types— classification  trees  and 
regression  trees.  The  difference  in  the  two  is  that  the  former  requires  the  dependent 
variable  of  the  input  examples  to  be  sharply  delineated  categories,  while  the  latter  requires 
it  to  be  an  ordered  response  variable.  The  former  assigns  a class  to  the  unseen  cases  while 
the  latter  predicts  the  value  of  the  response.  However,  both  can  accept  ordered  and 
unordered  independent  attributes. 

2,3  Classification  Trees 


2,3.1  Formal  Concepts 

Michalski  and  Kodratoff  (1990)  describe  three  basic  assumptions  underlying 
inductive  learning.  They  are  as  follows:  1.)  Generalization  is  the  basic  process  of 


21 


knowledge  acquisition;  2.)  A concept  coincides  with  its  description;  and  3.)  Good  concept 
descriptions  are  simple  and  effective.  Furthermore,  a concept  is  generally  described  by 
three  aspects.  These  are:  1.)  The  name  or  the  label  given  to  the  concept,  e.g.  high 
attention,  high  sales;  2.)  The  intentional  definition  or  description  which  is  a relation 
between  the  name  and  a given  set  of  features  that  helps  in  distinguishing  instances  of  that 
concept,  e.g.  if  a display  is  large,  on  the  left-hand  page,  and  placed  at  the  center,  then  the 
product  is  in  the  "high  sales"  category;  and  3.)  The  extension,  i.e.,  the  set  of  all  instances 
or  observations  that  belong  to  the  concept,  e.g.  all  possible  displays  of  products  that 
produce  high  sales. 

Each  instance  of  a concept  is  described  by  a fixed  set  of  k attributes  a„  a^...,  ak, 
having  observable  value  sets  V„  V2,..„  Vk,  where  Vs  is  assumed  to  be  finite.  The 
attributes  are  of  two  types:  1.)  Nominal  attributes  that  take  on  a finite,  unordered  set  of 
mutually  exclusive  values,  e.g.  position;  swatch  type,  etc.;  and  2.)  Linear  attributes  that 
are  linearly  ordered  sets  of  mutually  exclusive  real-  or  integer-values,  e.g.  display  size, 
number  of  swatches,  etc. 

The  description  of  an  instance  I is  of  the  form 

1 = (ai=vir,  a2=v2s,...,ak=vkt) 


where  Vy  e Vj 


22 


The  instance  space  defined  by  the  above  set  of  attributes  is  the  cross-product  of 
their  corresponding  values  sets  V„  V2,...,  Vk.  The  instance  space  can  be  considered  as 
a k-dimensional  space  consisting  of  a large  set  of  observations,  independently  of  the 
concepts  or  classes  to  which  they  belong.  The  instance  space  will  be  so  large  that  a 
realistic  database  of  observations  will  only  be  a small  fraction  of  the  entire  instance  space. 

Concepts  can  be  specified  on  this  instance  space  that  will  partition  the  entire 
instance  space  into  sets  of  objects  that  belong  to  the  concept  and  those  that  do  not.  What 
a learning  system  does  is,  in  fact,  learn  the  concept  descriptions  from  a very  small  subset 
of  instances  that  are  pre-classified.  A pre-classified  instance  is  a positive  example  of  the 
concept  if  it  belongs  to  the  concept  or  a negative  example  if  it  does  not  belong  to  the 
concept. 

For  an  instance  space  3,  the  number  of  distinct  subsets  or  concepts  over  3 is 
2 13 1 In  order  to  find  the  target  concept  or  a good  approximation  of  it,  the  learning 
algorithm  has  to  effectively  search  this  concept  space.  The  effectiveness  of  the  search 
depends  on  how  good  a heuristic  is  used  by  the  learning  algorithm. 

The  learning  system  receives  examples  of  the  concepts  as  input.  An  example  is 
an  observation  from  the  domain  of  interest  together  with  a label  indicating  whether  the 
observation  is  a member  of  the  set  representing  the  concept. 

An  example  X is  of  the  form 


X = (I,c,) 


23 


where  C|  e C is  the  i,h  class  of  the  set  of  all  classes  C and  I is  an  instance 
belonging  to  q. 

The  language  used  to  describe  the  concept  or  rule  determines  the  hypothesis  space 
used  by  the  learning  algorithm.  By  choosing  an  appropriate  language,  the  hypothesis  space 
can  be  made  smaller  than  the  concept  space,  thus  reducing  the  search  space.  Several 
alternative  forms  of  concept  descriptions  are  possible.  A concept  may  consist  of  a set  of 
conjuncts  corresponding  to  the  set  of  attributes,  each  conjunct  defining  the  possible  range 
of  values  of  the  corresponding  attribute.  Some  conjuncts  may  be  void  if  the  corresponding 
attribute  is  not  discriminatory  toward  the  concept.  The  concept  needs  to  contain  only 
terms  that  have  significant  attributes.  Some  concepts  may  be  simple  while  others  are 
complex.  Concepts  may  become  even  more  complex  by  taking  the  form  of  a set  of 
disjuncts  of  terms,  with  each  term  being  a set  of  conjuncts. 

A term  or  atom  A(  is  a boolean  value  of  the  form 
Ej  = (a,=vir) 

and  a pure  conjunctive  concept  F is  of  the  form 
F = Ej  a E2  a ...  a Em 

A pure  disjunctive  concept  H is  of  the  form 
G = E,  v E2  v ...  v Em 


24 


and  a disjunctive  concept  with  internal  conjunctions  K-DNF  expression  where  any 
m < K,  is  of  the  form 

H = F,  v F2  v v Fn 

Bias  is  defined  as  the  set  of  all  factors  that  collectively  influence  hypothesis 
selection  (Hausler,  1988).  These  factors  include  the  language  used  to  describe  the 
hypotheses  and  the  heuristics  used  by  the  learning  algorithm.  Bias  essentially  makes  the 
learning  task  easier  by  reducing  the  search  process  needed  to  choose  a particular 
hypothesis.  However,  bias  introduces  the  risk  of  not  finding  the  true  concept  within  the 
search  space.  The  stronger  the  bias,  the  easier  is  the  learning  task,  and  the  higher  the  risk 
not  finding  the  true  concept.  For  example,  using  a feature  space  of  two  attributes,  say, 
size  and  position",  to  describe  the  examples,  and  having  only  disjunctive  and 
conjunctive  logical  operators  to  construct  concepts  from  these  features,  the  search  space 
is  reduced  considerably.  So  do  we  increase  the  likelihood  of  not  being  able  to  describe 
a concept  with  the  available  vocabulary  of  this  language,  and  thus,  not  find  the  concept. 
The  search  heuristic,  (e.g.  entropy  measure),  determines  the  effectiveness  of  the  search. 

Ideally,  a learning  algorithm  should  learn  concepts  or  rules  that  would  correctly 
classify  all  the  unseen  cases.  Practically,  this  is  a very  difficult  goal  to  achieve  because 
of  many  factors.  Firstly,  the  nature  and  complexity  of  the  instance  space  might  preclude 
learning  a concept  without  using  the  entire  set  of  instances.  Secondly,  the  training  sample 
used  may  not  be  representative  of  the  domain  concept.  Thirdly,  the  descriptive  language 


25 


used  to  describe  examples  may  be  inadequate,  that  only  a partial  concept  can  be  learned. 
Fourthly,  the  sample  may  be  riddled  with  noise,  both  classification  and  attribute,  that  only 
an  approximate  or  incorrect  concept  is  learned. 

The  concept  or  rule  that  is  learned  is  a function  defined  on  the  instance  space 
that  maps  any  unseen  instance  x to  a class: 

$£(x)  = Cj  if  x e Cj 

and 

S£(x)  = — >Cj  if  x g Cj 

The  choice  of  attributes  used  for  the  description  of  the  instances  is  a difficult  but 
essential  preprocessing  task.  Inappropriate  choice  of  attributes  lead  to  complex  decision 
rules,  whereas  pertinent  choices  result  in  simple  and  comprehensive  rules.  Also,  the  wrong 
choice  of  attributes  may  render  the  attributes  inadequate  to  learn  the  true  concept.  Domain 
theory  is  helpful  in  determining  the  relevant  attributes. 

2.3.2  Construction  of  Classification  Trees 

The  ID3  accepts  as  input  a set  of  preclassified  training  examples  of  the  concept 
to  be  learned  and  iteratively  constructs  a decision  tree  that  describes  the  concept.  The 
fundamental  approach  is  to  select  the  attribute  that  "best"  divides  the  examples  into  their 
classes  and  partition  them  according  to  the  attribute  values,  and  then  recursively  repeat 
the  procedure  for  each  partition.  Each  internal  node  of  the  decision  tree,  including  the  root 


26 


node,  is  a test  node  and  each  leaf  node  is  a class  node.  At  every  test  node,  each  attribute 
is  evaluated  on  its  ability  to  partition  the  set  of  training  examples  based  on  some  measure 
of  goodness-of-split  (e.g.  entropy,  gain-ratio,  chi-square),  and  the  attribute  that  has  the 
highest  measure  is  selected.  For  each  value  of  the  selected  attribute,  a child  node  is 
created.  The  examples  belonging  to  the  parent  node  are  then  partitioned  according  to  the 
different  values  of  the  selected  attribute,  and  placed  in  each  of  the  corresponding  child 
nodes.  The  process  is  recursively  repeated  for  each  child  node  until  either  all  the 
examples  at  a node  are  of  the  same  class,  or  some  stopping  criteria  (e.g.  the  number  of 
examples  at  a node  are  below  a statistically  significant  threshold.)  is  satisfied. 

The  tree  induction  can  be  formally  represented  as  follows:  Let  S be  the  set  of 
examples.  If  S contains  examples  of  only  one  class,  then  the  decision  tree  is  simply  a 
node  with  that  class  assigned  to  it.  Otherwise,  let  T be  any  test  on  an  example  with 
possible  outcomes  c„  c2,...,  cw.  Each  example  in  S will  give  one  of  these  outcomes  for  T, 
so  T partitions  S into  { S,,  S2,...,SW}  with  S(  containing  examples  with  the  outcome  q. 
Each  Sj  is,  in  turn,  subject  to  the  test  T. 

The  description  of  the  concept  corresponding  to  any  leaf  node  is  the  conjunction 
of  the  attribute  values  along  the  path  from  the  root  to  that  leaf.  The  number  of  attributes 
along  the  path  may  be  few  relative  to  the  total  number  of  attributes  used  in  the  example 
set,  and  thus,  the  concept  could  be  relatively  simple.  If  more  than  one  leaf  is  of  the  same 
class,  then  the  concept  description  is  a disjunction  of  such  internally  conjunctive  terms 
that  correspond  to  the  path  of  each  leaf. 


27 


2,3.3  Issues  Related  to  Classification  Tree  Construction 

Goodness-of-split.  Several  methods  of  evaluating  the  goodness-of-split  have  been 
suggested  in  the  literature.  Entropy  (Quinlan,  1986)  is  an  information  theoretic  measure 
that  evaluates  an  attribute  for  its  ability  to  reduce  the  entropy  or  to  produce  the  highest 
information  gain  for  the  set.  The  gain  criterion  has  been  shown  to  prefer  attributes  with 
many  values  (Hart,  1984;  Quinlan,  1986;  Mingers,  1986).  An  alternative  that  is  suggested 
by  Quinlan  is  the  gain-ratio  criteria  which  has  a factor  included  that  takes  into 
consideration  the  multiple  values  of  an  attribute.  This  is  reported  to  have  improved  the 
performance  of  the  resulting  tree  in  terms  of  size  and  the  predictive  accuracy  in  situations 
where  multiple  valued  attributes  are  augmented  by  redundant  or  noisy  attributes. 

Chi-square  (x2)  has  been  suggested  as  an  alternative  measure  (Hart,  1984).  Chi- 
square  is  not  susceptible  to  the  same  multi-value  bias  as  the  entropy  measure.  Another 
advantage  in  using  x is  that  the  selection  of  insignificant  attributes  can  be  prevented,  thus 
reducing  the  tree  complexity.  Mingers  (1987)  has  reported  experiments  where  entropy 
produced  smaller  trees  than  %2,  and  that  the  trees  produced  in  both  cases  were  similar  at 
the  first  few  levels. 

For  the  information  gain  method,  a decision  tree  may  be  regarded  as  an 
information  source  that,  given  an  instance,  generates  a message  (class  c,  of  the  instance) 
from  m possible  messages  (classes).  If  the  classes  are  not  equally  likely,  then  let  ps  be  the 
probability  of  cs.  The  measure  of  information  content  in  that  message  is  given  by  - 
Ipjlog(pi). 


28 


To  compute  an  attribute’s  measure  of  information  gain  at  any  given  node, 
contingency  tables  can  be  used.  The  following  illustrates  such  a contingency  table: 


Table  2.1  Contingency  Table 


Class 

c, 

c2 

... 

ck 

Attribute 

Values 

V] 

nu 

ni2 

nik 

n,* 

v2 

n21 

n22 

... 

n2k 

n2* 

: 

: 

: 

: 

: 

V| 

»I1 

ni2 

n.k 

n,„ 

n.i 

n.2 

... 

n»k 

N 

where  q - i1"  class 

Vj  - jth  attribute  value  of  the  test  attribute 

nu  - number  of  examples  having  value  v}  for  the  test  attribute  and 
belonging  to  class  q 
q,  - row  sum  of  ith  row 
n*j  - column  sum  of  j,h  column 
N - total  number  of  examples  at  the  given  node. 


The  information  needed  to  classify  the  examples  without  prior  knowledge  of  the 
distribution  on  the  attribute  is 


M(C)  = -Sj(n,j/N)log(n,j/N) 


If  the  distribution  of  examples  on  the  attribute  values  is  known,  then  for  any  given 


value  V;  of  the  attribute,  the  necessary  information  would  be 


29 


M(Clvj)  = -Xj(nij/ni,)log(njj/ni,) 


The  necessary  information  for  the  attribute  (considering  all  values)  is  the  weighted 
average  of  the  information  for  each  value: 


M(CIA)  = -Zi(ni./N)Sj(nij/ni,)log(nij/n1.) 

Thus,  the  gain  in  information,  given  the  knowledge  of  the  attribute  is 

Information  gain  IG(A)  = M(C)  - M(CIA) 

To  take  into  account  the  multivalue  bias,  another  factor  is  considered.  This  factor 
is  the  information  value  of  the  attribute,  and  is  given  by 

Information  value  IV(A)  = -Zi(ni,/N)log(nj*/N) 

The  necessary  correction  is  done  by  using  the  gain-ratio  measure  as  given  by 

Gain-ratio  GR(A)  = IG(A)/IV(A) 

By  computing  the  information  gain  for  each  attribute,  the  attribute  producing  the 


greatest  information  gain  can  be  found. 


30 


Noisy  data.  Noise  is  a characteristic  that  is  imminently  present  in  concrete,  real- 
world  data  used  as  input  to  inductive  learning  systems.  The  two  types  of  noise  that  are 
fairly  common  are  classification  noise  and  measurement  noise.  Classification  noise  is  the 
judgmental  errors  of  the  expert  or  tutor  in  preclassifying  the  examples.  Measurement  noise 
is  the  errors  in  measuring  the  attributes  of  the  examples. 

A direct  result  of  a noise-affected  input  data  set  is  that  either  the  resulting  tree  will 
overfit  the  noise  and  become  very  complex,  or  the  attributes  will  become  inadequate  for 
the  learning  task  at  hand.  For  example,  in  the  first  case,  because  of  noise,  an  attribute 
may  be  selected  for  branching  at  a node  even  if  it  is  irrelevant  to  the  class  of  examples 
in  that  node.  In  the  second  case,  because  of  noise  two  examples  may  have  the  same 
attribute  values  but  belong  to  different  classes,  and  these  two  examples  will  end  up  in  the 
same  leaf  node  as  there  will  not  be  any  attribute  that  could  discriminate  the  two. 

The  ID3  procedure  is  modified  to  handle  the  above  two  situations  in  the  following 
manner.  To  prevent  spurious  growth,  a chi-square  test  of  stochastic  independence  is 
performed  at  each  node  on  the  attribute  selected  for  branching  to  determine  whether  the 
class  of  the  examples  in  the  node  is  independent  of  that  attribute.  The  procedure  prevents 
branching  on  attributes  found  to  be  independent  of  the  class,  thereby  preventing  the 
spurious  growth  of  the  tree.  In  the  case  of  inadequate  attributes,  the  leaf  node  having  a 
set  of  examples  belonging  to  more  than  one  class  is  assigned  the  more  numerous  class. 
This  method  is  claimed  (Quinlan,  1986)  to  be  superior  to  other  methods  for  minimizing 
expected  error  because  it  minimizes  the  sum  of  absolute  errors  over  all  the  objects  in  that 
node,  instead  of  the  sum  of  squared  error. 


31 


Pruning.  Inspite  of  measures  to  prevent  spurious  growth  described  in  the  above 
paragraph,  decision  trees  can  still  become  fairly  complex  and  require  further 
simplification.  One  way  of  accomplishing  simplification  is  by  retrospective  pruning. 
Pruning  makes  the  tree  more  comprehensible,  and  the  error  while  always  greater  for  the 
training  set,  is  usually  less  for  unseen  cases.  The  difference  in  error  will  depend  on  the 
degree  of  pruning.  In  the  event  that  error  improves  with  pruning  further  pruning  beyond 
some  point  may  cause  the  error  to  increase  again.  The  point  at  which  this  turn-around 
occurs  will  depend  on  the  nature  of  the  problem.  This  phenomenon  can  be  explained  by 
the  fact  that  the  tree  is  initially  overspecialized  and  restrictive,  and  thus,  causes  type  II 
or  false-on-positive  errors  on  the  unseen  cases.  As  pruning  is  increased  the  tree  becomes 
more  general  and  moves  closer  to  the  true  concept,  thus  reducing  the  error  on  unseen 
cases.  Beyond  a certain  point,  the  tree  becomes  overgeneralized  and  causes  type  I or  true- 
on- negative  errors. 

Several  pruning  methods  have  been  described  in  the  literature.  Some  of  the  more 
prominent  methods  are  cost-complexity  pruning  (Breiman  et  al.,  1984),  reduced  error 
pruning,  and  pessimistic  pruning  (Quinlan,  1987).  The  ID3  uses  pessimistic  pruning 
because  it  is  much  faster  and  does  not  require  a separate  test  set,  besides  being 
comparable  in  simplicity  and  accuracy.  The  pessimistic  pruning  method  proceeds  as 
follows:  For  each  non-leaf  subtree,  its  error  rate  is  compared  with  the  error  rate  after 
replacing  the  subtree  with  the  best  leaf  (i.e.  the  leaf  with  the  least  error  rate).  If  the  latter 
error  rate  is  within  one  standard  error  of  the  subtree  error  rate  then  the  leaf  permanently 
replaces  the  subtree.  Because  the  error  rates  based  on  the  training  set  are  not  reliable 


32 


estimates  for  the  unseen  cases,  the  error  is  corrected  pessimistically.  This  correction  is 
based  on  Yates’  correction  (Quinlan,  1987  & 1990). 

The  pessimistic  pruning  strategy  can  be  formally  stated  as  the  following:  For  any 
n training  examples  at  a leaf,  if  the  number  of  examples  not  belonging  to  the  designated 
class  of  that  leaf  is  e,  then  the  estimated  error  rate  for  this  leaf  is  e/n.  With  the  Yates’ 
correction  the  estimated  error  rate  would  be  (e  + 0.5)/n.  If  a subtree  S has  an  estimated 
error  rate  Es,  and  if  after  replacing  S with  the  best  leaf  L the  error  rate  becomes  EL,  then 
S will  be  permanently  replaced  by  L if  EL  lies  within  one  standard  deviation  of  Es. 

Transforming  to  production  rules.  Yet  another  method  for  simplifying  decision 
trees  is  to  convert  the  tree  into  an  equivalent  set  of  production  rules  (Quinlan,  1993).  This 
method  proceeds  as  follows:  Every  leaf  of  the  decision  tree  is  converted  to  a production 
rule  which  corresponds  to  the  path  from  the  root  to  that  leaf.  Then,  the  rules  are 
generalized  by  dropping  irrelevant  conditions.  The  relevance  of  a condition  is  determined 
by  the  Fisher’s  exact  test  of  statistical  significance.  Next,  the  rules  are  sifted  by  removing 
one  rule  at  a time  and  classifying  the  training  set  with  the  rest  of  the  rules.  If  the 
omission  does  not  increase  the  classification  error  the  rule  is  permanently  dropped. 
Finally,  certainty  factors  (CF)  are  assigned  to  each  rule  based  on  the  examples  satisfying 
the  left-hand-side  (LHS)  of  the  rule  and  the  examples  that  satisfy  the  LSH  as  well  as 
belong  to  the  class  indicated  by  the  right-hand-side  (RHS)  of  the  rule,  i.e.  if  N examples 
satisfy  the  LHS  of  the  rule,  of  which  M belong  to  the  class  C indicated  by  the  RHS,  then 
the  CF  is  given  by  (M  - 0.5)/N. 


33 


Real-valued  attributes.  Continuous  valued  attributes  cannot  be  tested  for  the 
measure  of  split  unless  the  range  is  divided  into  a number  of  subranges.  Mingers  (1986) 
pointed  out  the  combinatorial  nature  of  dividing  into  several  subranges  and  testing,  and 
suggested  that  the  only  way  forward  is  to  restrict  the  subranges  to  two.  Another  method 
that  Mingers  suggested  was  to  get  the  expert  to  specify  the  appropriate  ranges  and 
thereby,  effectively  make  the  attribute  a nominal  one.  The  ID3  algorithm  has  been 
modified  from  its  original  version  to  use  a single  threshold  value  to  divide  a continuous 
valued  attribute  to  a dichotomy,  i.e.  values  above  and  below  the  threshold.  Thus,  the 
example  set  can  be  partitioned  accordingly. 

When  classifying  an  unseen  example  that  has  a value  for  a real  attribute  close  to 
the  threshold,  a knife-edge  decision  to  follow  one  branch  or  the  other  may  be 
inappropriate.  A better  approach  would  be  to  follow  both  branches.  The  ID3  has  been 
modified,  so  that  without  assigning  an  unseen  case  to  a single  class,  the  algorithm  uses 
the  relative  probability  of  branches  in  the  tree  and  the  composition  of  the  subsets  of 
training  cases  at  the  leaves  to  arrive  at  a distribution  over  one  or  more  of  the  classes 
(Quinlan,  1990). 

Probabilistic  classification.  When  the  stopping  criteria  or  pruning  is  applied,  some 
of  the  resulting  leaves  will  have  training  examples  of  both  classes.  Usually,  the  class 
assigned  to  the  leaf  is  that  of  the  more  numerous  class.  A better  approach  is  to  assign  a 
probability  to  this  class  based  on  the  proportion  of  cases  belonging  to  that  class  and  the 
total  number  of  cases  in  that  leaf  (Quinlan,  1990).  This  will  result  in  a probabilistic 


classification. 


34 


Given  n and  e as  the  number  of  training  examples  at  a leaf  L and  the  number  of 
examples  that  do  not  belong  to  the  class  C of  L,  the  probability  that  any  test  instance  that 
ends  up  in  L will  not  belong  to  C is  (n-e)/n.  With  Yates’  correction  this  probability  would 
be  (n-e-0.5)/n. 

Windowing.  The  ID3  uses  a windowing  technique  where  it  initially  considers  only 
a subset  of  the  training  data  to  construct  a tree.  This  tree  is  then  used  to  classify  the 
examples  outside  the  window.  If  all  these  examples  are  correctly  classified  the  algorithm 
stops,  otherwise  some  of  the  incorrectly  classified  examples  are  added  to  the  window  and 
the  tree  building  process  is  repeated.  This  iterative  technique,  reportedly,  forms  a correct 
decision  tree  much  quickly  than  by  using  the  entire  training  set  (Quinlan,  1986). 

Null  nodes.  A "null"  node  may  be  created  when  some  value  of  a branching 
attribute  has  no  corresponding  examples.  Therefore,  during  classification,  the  system  will 
fail  to  classify  any  object  arriving  at  this  node.  The  ID3  has  avoided  this  problem  by 
assigning  to  a null  node,  the  predominant  class  of  its  parent  node. 

Missing  values.  Sometimes  examples  may  have  attributes,  of  which  the  values  are 
missing.  This  creates  a problem  during  the  construction  of  the  tree.  The  ID3  has  been 
modified  to  distribute  the  unknown  examples  according  to  the  proportion  of  occurrences 
in  the  known  examples. 

2.3.4  Performance  Evaluation 

The  performance  of  tree  building  algorithms  can  be  evaluated  based  on  the  size 
of  the  trees  it  constructs  and  the  predictive  accuracy  of  the  concepts  it  develops.  The 


35 


simplicity  can  be  measured  by  the  number  of  leaves  or  nodes  in  the  tree.  The  predictive 
accuracy  can  be  evaluated  by  an  estimate  of  the  true  error  rate.  Though  comprehensibility 
is  also  a dimension  along  which  trees  could  be  evaluated,  quantifying  or  measuring  it  is 
very  difficult. 

Mingers  (1989a;  1989b)  reported  results  of  a number  of  experiments  using 
measures  of  goodness-of-split  on  data  sets  from  different  domains.  Results  of  these 
experiments  show  very  strong  size  differences  across  domains,  and  noise  leads  to  very 
large,  bushy  trees.  There  were  also  significant  differences  in  size  across  measures,  gain- 
ratio  producing  the  smallest  trees  and  chi-square  producing  the  largest  trees.  However, 
using  a selection  criteria  reduced  the  size  of  the  tree  considerably  in  comparison  to  a 
random  selection  strategy. 

The  error  rates  did  not  show  any  significant  differences  across  measures  of  split. 
Pruning  the  tree  improved  the  error  rate  universally  regardless  of  the  measure  of  split. 
Mingers  concluded  that  the  accuracy  depends  almost  entirely  on  the  amount  of  noise  and 
the  degree  of  pruning,  and  not  on  the  measure  of  split. 

2,3.5  Summary 

Numerous  inductive  approaches  are  being  developed.  Among  these,  ID3  stands  out 
as  one  of  the  best.  The  ID3  is  a symbolic,  non-incremental,  general  purpose,  supervised 
inductive  learning  system  that  uses  decision  trees  as  its  knowledge  representation 
formalism.  Over  the  years,  ID3  has  undergone  several  revisions  in  which  improvements 
have  been  made  to  address  issues  raised  in  the  literature  over  characteristic  problems 


36 


inherent  in  real  data.  Some  of  these  problems  are  noisy  data,  real-valued  attributes, 
inadequate  discriminating  power  of  attributes  or  irrelevant  attributes,  missing  values,  bias 
towards  multi-valued  attributes.  Some  of  the  remedial  measures  that  have  been 
incorporated  into  ID3  are  the  stopping  criteria  based  on  statistical  significance, 
retrospective  pruning,  different  selection  criteria  at  decision  nodes,  softening  thresholds, 
etc. 


2.4  Regression  Trees 


2,4,1  Introduction 

Most  inductive  learning  methods  have  been  developed  for  classification,  in  the 
sense  that  they  require  a discrete  variable  as  the  dependent  variable.  Once  the 
classification  rules  are  learned,  these  classifiers  can  classify  any  unseen  observation. 
When  the  class  variable  is  continuous,  it  has  to  be  a-priori  discretized  in  order  to  use 
these  methods.  The  discretization  could  be  very  vague  and  arbitrary,  and  thus,  can  lead 
to  awkward  and  unsatisfactory  results. 

Regression  trees  (Breiman  et  al„  1984)  are  an  alternative  formalism  to 
classification  trees,  that  could  predict  an  ordered  response  instead  of  a class  label. 
Regression  trees,  like  classification  trees,  represent  knowledge  as  decision  trees.  All  the 
internal  nodes  of  the  tree  are  decision  nodes,  and  the  leaf  nodes  correspond  to  a distinct 
value  of  the  response  variable  rather  than  some  class.  Thus,  the  value  of  the  response  of 
an  unseen  observation  is  predicted  by  allowing  the  observation  to  pass  through  the  tree 


37 


and  settle  on  a leaf  node.  The  value  of  that  leaf  node  is  the  predicted  value  of  the 
response. 

2,4,2  Regression  Tree  Building  Procedure 

The  regression  tree  is  built  by  recursively  partitioning  the  set  of  examples  based 
on  some  splitting  criteria,  until  no  further  improvements  can  be  achieved  by  splitting,  or 
until  the  number  of  observations  in  a given  node  becomes  less  than  some  pre-specified 
value.  The  partitioning  is  binary.  First,  the  tree  is  grown  to  its  fullest  extent  using  a set 
of  examples,  then  using  a separate  set  of  examples  the  tree  is  pruned  back  to  different 
depths  and  a series  of  trees  is  created.  The  best  tree  among  these  that  minimizes  the  error 
most  is  then  selected. 

At  a given  node,  the  reduction  in  MSE  is  used  as  the  splitting  criterion.  During 
learning,  MSE  is  estimated  by  the  resubstitution  method.  Resubstitution  error  is  given  in 
terms  of  the  total  sum  of  squares  deviation  from  the  mean  of  the  training  examples  in  a 
particular  node. 

Let  an  example  X be  described  as  (x,y),  where  instance  x falls  in  a measurement 
space  3 and  y is  a real-valued  number.  For  any  node  t the  resubstitution  error  R(t)  is 
given  as 


R(t)=I  E (Yi-y(t))2 
n xiet 


where  y(t)  is  the  mean  value  of  Yj  € t,  and  n is  the  number  of  observations. 


38 


The  attribute  value  pair  that  causes  the  best  split  by  reducing  the  error  most  is 
selected  as  the  splitting  rule.  Finding  the  best  split  is  a two  step  process.  In  the  first  step 
the  best  split  for  each  attribute  is  found.  Then,  the  attribute  that  produced  the  best  split 
is  selected. 

In  order  to  determine  the  best  split,  for  each  attribute,  a set  of  candidate  splits  are 
created.  The  set  of  candidate  splits  for  a nominal  attribute  is  the  set  of  its  values.  The  two 
partitions  created  by  a split  would  have  observations  that  have  the  particular  value  of  that 
attribute  and  the  ones  that  do  not.  In  the  case  of  an  ordered  attribute,  the  candidates  could 
either  be  all  values  of  that  attribute  that  occurred  in  the  learning  set  or  some  finite  set  of 
values  at  fixed  intervals.  Here,  the  two  partitions  created  would  be  those  observations  that 
have  the  attribute  value  greater  than  the  candidate  and  those  that  are  less  than  or  equal 
to. 

For  unordered  or  nominal  attributes  at,  the  candidate  set  consists  of  as  = vlj5  for  all 
Vy  of  as  in  the  training  set.  Then,  the  attribute  that  gives  the  best  split  is  selected.  For 
ordered  attributes  a^  the  candidate  set  consists  of  all  partitions  based  on  ^ < vy,  for  all 
values  Vy  of  a;  in  the  training  set. 

For  any  node,  given  a set  of  examples  E={X},  a partitioning  of  this  set  to  EL  and 
Er  would  be  such  that, 

(i)  For  a nominal  attribute: 

El  = {X  I a*  = Vy} 


and  Er  = {X  I ^ * Vy} 


39 


(ii)  For  an  ordered  attribute: 


El  = {X  I a,  < Vy } 
and  ER  = { X I aj  > Vjj  > 

For  any  split  s that  partitions  node  t to  tL  and  tR,  the  change  in  MSE  is 
AR(s,t)  = R(t)  - R(tL)  - R(tR) 

where  R(tL)  and  R(tR)  are  the  respective  error  estimates  for  tL  and  tR. 

Then,  for  a given  set  of  candidate  splits  S in  t,  the  best  split  s*  is  such  that 

aR(s*,0  = max  AR(s,t) 
se  S 

Two  stopping  criteria  are  used  to  terminate  the  growth  of  the  tree.  First,  whenever 
the  change  in  the  resubstitution  estimate  for  MSE  of  the  best  split  at  a node  is  less  than 
a pre-specified  threshold,  the  node  is  declared  terminal.  Second,  if  the  number  of 
observations  in  a node  is  below  a pre-specified  threshold,  then  the  node  is  declared 


terminal. 


40 


Each  terminal  node  of  the  tree  contains  the  mean  value  and  the  standard  deviation 
of  the  dependent  variable  for  all  the  observations  belonging  to  that  node.  This  value  is  the 
predicted  response  of  that  leaf  and  is  constant  for  that  node.  The  error  estimate  for  the 
tree  R(T)  is  the  sum  of  the  error  estimates  of  the  leaf  nodes. 

Pruning  is  performed  using  a separate  test  set.  Instead  of  the  resubstitution  error 
estimate  which  was  used  for  growing  the  tree,  test  sample  estimate  is  used  for  pruning. 
First,  the  test  set  is  passed  through  the  tree  and  the  test  sample  estimate  is  computed  for 
each  node.  For  any  node  t,  the  test  sample  estimate  R'(t)  is  given  as 

R'(t)  =1  £ (y*  - ^(xj))2 

n xiet 

where  S£(Xj)  is  the  response  value  of  X,  predicted  by  t,  and  n is  the  number  of 
observations.  The  test  sample  error  of  the  tree,  R'(T),  is  the  sum  of  the  test  sample 
estimates  of  the  leaf  nodes. 

Second,  a candidate  pruning  set  of  sibling  leaf  pairs  is  created.  Out  of  the 
candidate  set,  the  pair  whose  removal  causes  the  greatest  reduction  or  the  least  increase 
in  the  error  of  the  tree  is  selected  as  the  pair  to  be  pruned.  This  pair  is  then  replaced  by 
the  parent.  As  only  pairs  of  nodes  enter  the  candidate  set,  this  parent  node  will  enter  the 
candidate  set  only  when  its  sibling  becomes  a leaf.  The  process  continues  until  only  the 
root  node  is  remaining.  Finally,  out  of  the  series  of  trees,  the  one  with  the  least  test 
sample  error  is  selected. 


41 


2.4.3  Prediction  and  Error  Estimation 

To  predict  the  response  of  an  unseen  observation,  it  is  passed  through  the  pruned 
tree  until  it  reaches  a leaf  node.  The  predicted  value  is  the  mean  value  contained  in  that 
leaf.  The  predictive  performance  of  the  regression  tree  can  be  evaluated  by  measuring  test 
sample  error  of  the  tree,  R'(T),  on  a test  set  of  observations. 

Because  R/(T)  depends  on  the  scale  in  which  the  response  is  measured,  a 
normalized  measure  is  used  to  eliminate  scale  dependence.  This  measure  called  the 
relative  means  squared  error,  RE'(T),  is  defined  as 

RE'(T)  = R'(T)/R(y) 

2.5  Business  Applications  of  Tree  Induction 

While  one  branch  of  empirical  learning  research  is  engaged  in  an  attempt  to 
improve  the  learning  methodologies,  interest  in  their  application  to  real  problems  is  also 
beginning  to  emerge.  A few  studies  that  used  tree  induction  in  the  business  domain  have 
been  reported  in  the  literature.  These  studies  have  used  Inductive  Learning  for  prediction 
and  classification  tasks. 

Tam  (1991)  used  decision  tree  induction  to  develop  trading  rules  for  stock 
screening.  They  used  a training  set  consisting  of  stocks  traded  in  the  NYSE  over  the 
period  1980-1984.  The  financial  attributes  used  are  both  accounting  ratios  and  price 
movement  information.  Price  is  the  classificatory  variable.  The  class  of  "high-growth 
stock"  was  assigned  to  those  stock  that  had  its  price  doubled  over  an  year.  They  reported 


42 


that  the  portfolio  selected  using  trading  rules  generated  by  the  tree  induction  method 
consistently  outperformed  the  NYSE  composite  index  across  the  years.  Similar 
performance  is  reported  for  the  S&P  500  index.  The  author  claims  that  though  the  results 
are  not  conclusive,  the  induced  decision  trees  can  detect  regularities  contrary  to  the  beliefs 
espoused  in  the  theory  of  market  efficiency. 

Braun  and  Chandler  (1987)  used  rule-induction  to  predict  the  market  calls 
of  the  stock  analyst  and  the  actual  movement  of  the  stock  market.  The  decision 
environment  is  very  dynamic.  The  traditional  stock  analysis  techniques  such  as  trend 
analysis,  cycle  analysis,  charting  techniques,  etc.,  are  unreliable.  As  a result  stock  analysts 
base  their  predictions  on  the  results  of  a number  techniques.  Statistical  techniques  such 
as  discriminant  analysis,  regression  analysis,  logit  and  probit  regression  analysis  are  used 
to  construct  models  of  the  stock  market  behavior.  Rule  induction  is  an  alternative 
approach  to  modelling  the  stock  market  behavior  and  it  can  integrate  qualitative  and 
quantitative  measures,  unlike  its  statistical  counterparts.  It  can  also  recognize  the 
importance  of  the  individual  measures.  The  predictive  accuracy  of  the  experimental  results 
range  from  57.5  to  65.0  percent.  The  results  show  that  the  induced  rules  predicted  the 
market  better  than  they  predicted  the  analyst’s  calls  and  also  better  than  the  analyst’s 
prediction  of  the  market.  It  is  shown  that  selecting  the  set  of  examples  that  cover  the 
spectrum  of  possible  situations  is  more  important  than  the  actual  number  of  examples. 

Inductive  algorithms  have  been  used  to  discover  predictive  knowledge  structures 
of  financial  data  in  a more  static  decision  environment  (Messier  and  Hansen,  1988).  The 
diagnostic  and  predictive  validity  of  the  inductive  approach  is  evaluated  on  loan  default 


43 


data  and  bankruptcy  data.  The  performance  of  the  inductive  method  is  compared  to 
discriminant  analysis  models,  individual  judgements,  and  group  judgements.  The 
performance  measures  used  are  1.)  the  number  of  attributes  required  for  prediction,  and 
2.)  percentage  of  correct  classifications.  Predictive  accuracy  of  87.5  and  100  percent  were 
achieved  for  loan  default  and  bankruptcy,  respectively.  The  inductive  models  were  found 
to  have  performed  better  than  the  other  three  models.  While  the  predictive  accuracy  was 
higher,  fewer  attributes  were  required  for  the  inductive  models.  They  argue  that  the 
inductive  method  is  useful  in  producing  decision  rules  for  small,  highly  structured 
problem  domains,  but  difficult  to  apply  to  very  large  problem  domains.  The  method  is 
also  reported  to  be  weak  in  handling  missing  and  conflicting  data. 

In  an  empirical  study  using  inductive  learning  to  determine  the  risk  classification 
of  commercial  bank  loans,  Shaw  and  Gentry  (1988)  report  similar  results  on  the  predictive 
accuracy.  The  decision  rules  are  inferred  from  qualitative  and  quantitative  information 
about  the  firm.  The  prediction  task  is  to  effectively  discriminate  between  potential  failures 
and  non-failures  from  data  one  year  ahead.  The  predictive  and  classification  accuracy 
compare  favorably  with  those  resulting  from  logit  models. 

Another  interesting  financial  application  of  ID3  is  in  investment  appraisal  (Race 
and  Thomas,  1988).  The  study  explores  the  use  of  rule  induction  in  interpreting  financial 
simulation  results.  Because  traditional  methods  such  as  monte  carlo  simulation  does  not 
directly  identify  the  factors  to  which  profitability  is  most  sensitive.  Race  and  Thomas  use 
rule  induction  to  identify  the  major  risks  facing  an  investment.  The  data  set  they  used  had 
1000  examples  with  five  integer-valued  attributes  that  were  the  simulated  variables.  Net 


44 


present  value  (NPV)  is  used  as  the  classificatory  variable  by  mapping  its  continuous 
values  into  discrete  logical  classes.  If  the  learning  task  is  to  learn  rules  that  could 
distinguish  between  profit  and  loss,  then  NPV  can  be  divided  into  binary  classes  around 
the  value  zero.  Their  initial  experiments  produced  extremely  large  trees  because  of  the 
multiple  classes  that  needed  to  be  classified.  By  collapsing  the  outermost  classes,  they 
were  able  to  reduce  the  size  of  the  tree,  but  not  significant  enough  to  achieve  any 
reasonable  level  comprehensibility.  They  found  some  irrelevant  attributes  appearing  at  the 
top  of  the  tree,  and  attribute  this  to  noise  or  the  insufficiency  of  the  number  of  examples 
in  the  example  set.  They  have  not  reported  the  predictive  accuracy  of  the  rules  learned. 

Currim  et  al.  have  used  ID3  in  consumer  choice  analysis  studies  to  model  the 
contingent  manner  in  which  consumers  process  product  attribute  information.  They  used 
two  years  of  UPC  scanner  panel  data  of  regular  ground  coffee  purchases  of  200 
households  in  a confined  geographic  area.  Some  of  the  attributes  that  were  recorded  were 
price,  brand  name,  feature  advertisement,  store  display,  etc.  All  the  data  were  represented 
as  dichotomous  variables.  They  applied  the  ID3  algorithm  to  each  household’s  data  and 
constructed  household  level  models  for  each.  Their  results  indicate  that  in  predicting 
brand  choices  only  3%  of  the  derived  trees  yield  prediction  rates  of  less  than  90%.  They 
compared  these  results  with  logit  modelling,  and  found  them  to  be  comparable  in 
performance. 


CHAPTER  3 
DOMAIN  PROBLEM 

3.1  Overview  of  Catalog  Sales 

Selling  merchandize  via  direct  mail-order  catalogs  is  a multi-billion  dollar  industry 
experiencing  phenomenal  growth  (Oren,  1989;  McCorkle,  1990).  Home  shopping  via 
mail-order  catalogs  has  become  increasingly  popular  for  several  reasons.  First,  as  social 
and  economic  conditions  change,  society  is  turning  into  what  is  known  as  a "cash-rich, 
time-poor"  society.  As  a result,  consumer  shopping  orientations  change  from  shopping  at 
stores  to  home-shopping  (Lockett  and  Holland,  1991;  Stephenson,  1989).  Growth  rate  of 
non-store  retailing  is  almost  double  the  growth  rate  of  in-store  retailing,  and  the 
purchasers  tend  to  be  young,  educated,  and  affluent  (Peterson  et  al.,  1989).  To  meet  these 
changing  needs,  major  retailers  have  expanded  their  range  of  marketing  strategies  by 
offering  such  services  as  direct  marketing,  home-shopping  through  direct  mail-order,  etc. 
Second,  there  is  a segment  of  consumers  who  derive  little  or  no  pleasure  from  the  store 
shopping  process  (Shim  and  Mahoney,  1991).  Third,  selling  by  direct  mail-order  catalogs 
and  by  electronic  media  may  be  more  cost  effective  alternatives  to  selling  through 
traditional  retail  outlets  because  of  the  high  overheads  associated  with  the  latter  in  storing 
and  displaying  products  in  central  locations  where  space  is  limited  and  expensive. 

Catalog  sales  operations  can  have  a variety  of  organizational  structures.  A common 
type  of  structure  is  the  functional  orientation,  organized  as  purchasing,  advertising,  and 


45 


46 


merchandizing.  The  purchasing  department,  through  a contingent  of  buyers  spread  across 
the  country,  procures  the  merchandize.  The  advertising  department  creates  the 
merchandize  catalog,  the  sole  instrument  that  facilitates  the  sales  by  conveying  the 
product  information  to  the  consumer.  The  merchandizing  department  using  its  knowledge 
of  competition  and  consumption,  sets  the  price  to  maximize  the  sales,  oversees  the 
delivery  of  goods  to  the  consumer,  and  maintains  control  over  the  inventory. 

Merchandize  catalogs  are  designed  to  communicate  information  about  each 
product,  including  its  appearance,  cost,  range  of  colors  and  sizes,  other  complementary 
and  complimenting  products  and  accessories,  and  other  pertinent  information.  The 
objective  of  the  design  is  to  influence  the  purchase  decision.  A marketing  communication 
affects  the  consumer’s  decision  making  process  by  providing  visual  stimuli,  drawing  their 
attention  to  product  features  and  benefits,  creating  an  interest  in  the  product,  and  finally, 
helping  in  the  subsequent  decision  to  purchase. 

The  content  of  a display  can  affect  the  consumer  only  when  it  has  attracted  the 
consumer  s attention  (Diamond,  1968).  Thus,  understanding  the  relationship  between  the 
display  format  and  its  attention-getting  power  is  important  for  a successful  catalog  design. 
The  display  characteristics  together  with  product  features  determine  how  well  a product 
receives  attention  of  the  catalog  shopper.  The  catalog  page  layout  helps  focus  attention 
and  control  the  flow  of  attention.  The  design  provides  cues  that  assist  the  consumer  to 
perceive  product  benefits,  user  characteristics,  and  usage  situations  (Janiszewski,  1990). 


47 


3.2  Problem  Analysis 

One  of  the  most  critical  decisions  that  have  to  be  made  in  design  process  is  how 
best  to  display  the  products  in  the  catalog,  because  the  purchase  decision  of  the  consumer 
is,  to  some  extent,  determined  by  the  manner  in  which  the  products  are  displayed 
(Curhan,  1973).  This  entails  knowledge  of  what  display  features  significantly  influence 
the  attention  and  the  interest  of  the  consumer,  and  how  this  interest,  in  turn,  relates  to 
sales.  This  knowledge  is  helpful  in  reducing  the  uncertainty  of  the  sales  outcome,  thus, 
enabling  better  decision  making  both  in  purchasing  and  merchandizing  as  well.  Thus, 
apart  from  product  quality,  an  effective  catalog  design  will  have  organization-wide 
impact  on  performance  as  well  as  on  the  ultimate  profitability  of  the  catalog  operation. 

The  layout  of  a catalog  still  remains  a subjective  and  intuitive  process,  and 
practitioners  make  decisions  largely  on  the  basis  of  past  experience,  creative  insight,  and 
intuition  (Burke  et  al.,  1990).  The  layout  decisions,  made  routinely  by  the  designers  with 
some  guidance  from  the  art  director,  are  unstructured  and  error-prone.  There  are  no  cut- 
and-dried  methods  used  for  handling  the  layout  of  a page.  The  design  of  each  page  is 
novel  and  subject  to  the  judgement  and  creativity  of  the  designer.  The  outcome  is 
unpredictable,  especially  when  each  page  of  the  catalog  is  the  work  of  a different 
designer,  working  individually  and  independently,  and  a variety  of  creative  options  exist. 
Perhaps,  very  little  or  no  consideration  is  given  to  the  mechanism  by  which  a consumer 
assimilates  and  weighs  information  presented  on  a complete  page. 


48 


Each  page  of  the  catalog  contains  more  than  one  product,  and  hence,  not  only 
should  the  design  take  into  consideration  the  display  of  each  product  and  the 
accompanying  information  but  also  the  competing  demands  for  space  utilization.  While 
a single  product  displayed  on  a page  will  have  a captive  audience,  and  hence  increase  the 
opportunity  to  sell  the  product,  the  opportunity  cost  of  not  having  a second  product  in  the 
page  might  far  offset  the  relative  gains.  Hence,  a single  product  per  page  may  not  be  the 
best  design  choice.  A well  designed  multi-product  display  could  help  in  achieving 
higher  sales  per  page,  thus,  maximizing  the  benefit  - cost  ratio  of  the  catalog.  On  the 
other  hand,  a poorly  designed  page  layout  may  increase  the  sales  of  one  product  at  the 
cost  of  one  or  more  other  products  that  appear  on  a page,  and  thus,  sub-optimize  the  cost 
- benefit. 

Not  every  display  gets  the  same  attention,  some  are  viewed  more  and  others  less. 
While  the  product  characteristics,  by  themselves,  are  of  interest,  the  display 
characteristics,  such  as  position,  size,  presentation,  isolation,  etc.  can  add  value  to  the 
product  or  even  make  the  product  appear  totally  unattractive.  A product  having  attractive 
features  that  would  otherwise  sell  well,  may  get  overlooked  by  customers  because  of  its 
display  characteristics,  and  end  up  performing  poorly.  Documented  evidence  have  shown 
that  the  display  can  make  a difference  in  the  sales.  Thus,  distinguishing  between  display 
features  that  encourage  and  discourage  attention  will  lead  to  better  designs. 

The  important  question,  then  is,  what  constitutes  a good  design  of  a catalog  page? 
While  the  outcome  of  a well-designed  catalog  page  may  be  easily  described,  how  to  put 
together  a page  that  would  affect  such  an  outcome  is  a very  difficult  problem.  Neither  is 


49 


there  a scientific  methodology  nor  is  there  well-recognized  expertise.  Layouts  are  based 
on  the  intuition  and  the  creative  insights  of  the  layout  artist  and  the  art  director.  Whether 
this  creativity  produces  the  desired  outcome  or  is  purely  an  aesthetic  artifact  is  not  known. 
Stated  differently,  which  characteristics  of  the  page  layout  affects  the  sales  outcome  in 
what  ways,  is  not  well  understood. 

3.3  Problem  Specification 

In  order  to  design  a catalog  that  could  potentially  achieve  the  ultimate  goal,  first, 
the  relationship  between  the  display  characteristics  and  the  sales  outcome  has  to  be 
understood.  This  will  help  in  developing  a set  of  guidelines  or  rules  for  the  artist  to  use 
in  the  design  process  and  allow  the  artist  to  be  more  objective  rather  than  being  purely 
subjective.  The  rules  will  be  helpful  in  avoiding  seemingly  attractive,  but  yet,  poorly 
selling  product  display  configurations  on  a page. 

Uncovering  the  relationship  between  the  display  characteristics  and  the  sales, 
though  useful,  does  not  cast  light  on  the  intermediate  processes,  that  is,  how  the  display 
draws  the  attention  of  the  consumer  and  influences  the  purchase  decision  or  the  sales. 
Thus,  understanding  the  relationship  between  display  characteristics  and  attention,  and 
furthermore,  attention  and  sales,  will  be  helpful  in  validating  the  previous  model,  as  well 
as  useful  in  fine-tuning  the  catalog  design  prior  to  roll  out. 

One  method  of  measuring  attention  is  to  analyze  viewing  patterns  of  consumers 
as  they  scan  through  the  catalog.  Several  different  viewing  characteristics  such  as  the 
order  of  scanning  products,  the  duration  of  viewing  a product,  the  pupil  size,  can  be  used 


50 


as  a measure  of  attention  (Janiszewski,  1989).  These  viewing  characteristics  can  be 
measured  in  a laboratory  setting  using  eye-tracking  equipment  to  unobtrusively  monitor 
the  eye-movement  of  a sample  set  of  subjects  as  they  scan  through  the  catalog. 

The  difficulty  of  determining  these  relationship  is  compounded  by  the  fact  that  the 
product  characteristics  are,  by  themselves,  appealing,  and  could  influence  the  sales 
outcome.  Isolating  this  effect  will  require  laboratory  manipulation  of  the  display  variables 
for  the  same  set  of  products  across  each  page  and  measuring  some  surrogate  for  the  sales 
outcome  (Janiszewski,  1989). 

Two  models  are  of  interest,  the  understanding  of  which  will  be  beneficial  to  the 
development  of  the  theory  as  well  as  to  the  design  of  the  catalog.  They  are  the  following: 


Model  1 


DISPLAY 

SALES 

> 

ATTRIBUTES 

Model  2 


The  first  model  represents  a direct  relationship  between  the  display  features  and 
sales.  Though  this  may  be  of  value  to  a practitioner  in  deciding  how  to  design  a product 


51 


display  to  enhance  sales,  it  does  not  capture  the  intellectual  underpinnings  of  the  process 
by  which  a consumer  is  influenced  to  purchase. 

The  second  model  is  a more  comprehensive  model.  In  essence,  it  is  a 
decomposition  of  the  first  model  to  its  basic  component  relationships.  The  first  segment 
of  this  model  attempts  to  capture  the  essence  of  the  problem  of  engaging  the  attention  of 
the  consumer.  The  consumer  has  to  be  attracted  to  the  product  before  any  purchase 
behavior  can  take  place.  The  second  segment  of  the  model  tries  to  capture  what  level  of 
attention  will  finally  affect  a purchase.  Modelling  these  relationships  independently  will 
allow  laboratory  manipulation  of  the  display  format  and  to  measure  the  effect  on 
attention,  and  in  turn,  to  predict  the  resulting  sales  outcome  (Janiszewski,  1990). 

3.4  Benefits 

Understanding  the  relationships  will  be  beneficial  to  both  the  researcher  and  the 
practitioner.  The  researcher  can  use  the  knowledge  to  confirm  or  refute  existing  domain 
theories,  or  to  develop  new  theories.  The  practitioner  can  use  the  knowledge  in  the  design 
process  to  improve  and  make  better  designs. 

Knowing  how  attention  influences  sales  is  helpful  in  laboratory  testing  of  design 
alternatives  and  predicting  sales  prior  to  the  catalog  roll  out.  Alternatively,  knowing  how 
the  display  attributes  affect  attention  is  useful  in  developing  designs  that  will  draw  the 
right  amount  of  interest  for  a required  level  of  sales. 

The  benefits  will  be  extended  to  other  related  areas  such  as  inventory  planning  and 
control  and  price  setting.  Reduction  in  the  uncertainty  of  the  sales  outcome  based  on  the 


52 


display  enables  better  sales  forecasts,  and  in  turn,  better  inventory  control.  A better  design 
will  naturally  augment  sales,  and  in  turn,  allow  higher  prices. 

Formalized  and  codified  knowledge  appropriate  for  the  design  function  will  exist 
and  can  be  used  as  expert  system  to  facilitate  decision  making.  A consultation  system  can 
make  recommendations  and  suggestions  on  design  alternatives,  potential  flaws,  etc. 


CHAPTER  4 

EXPERIMENTAL  METHODOLOGY,  RESULTS,  AND  ANALYSIS  - PART  1 

4, 1 Research  Goals 

The  research  that  we  have  conducted  is  mainly  of  an  exploratory  nature  having 
several  broad  research  goals: 

Demonstrate  the  relevance  of  machine  learning  techniques  in  solving  concrete 
problems.  Machine  learning  is  a relatively  new  technique  compared  to  classical  statistical 
techniques  for  classification  and  prediction  tasks,  and  for  discovering  unknown 
relationships.  It’s  potential  is  still  not  fully  explored  and  it’s  merits  and  demerits  not  fully 
understood.  Much  exploratory  work  is  required  before  it  can  be  accepted  as  a viable 
alternative  to  statistical  techniques  in  business  domains  such  as  marketing. 

Investigate  the  concept  learning  behavior  based  on  rule  induction  using  decision 
trees.  Decision  tree  induction  has  had  a more  promising  outlook  than  other  competing 
learning  methods.  Algorithms  for  constructing  decision  trees  have  been  continually 
improved  in  order  to  meet  the  challenges  of  learning  concepts  from  actual  data  with  all 
their  inherent  defects  and  deficiencies.  The  effectiveness  of  these  improvements  for 
learning  true  concepts  in  specific  domains  is  uncertain  and  has  to  be  better  understood. 

- Investigate  insightful  relationships  in  the  domain  that  would  contribute  towards 
supporting  or  refuting  existing  domain  theories  or  lead  to  new  theories.  Much  has  been 
studied  and  written  about  print  advertisements  and  shelf  displays  in  supermarkets.  Few 


53 


54 


relationships,  such  as  the  influence  of  the  size  and  position  on  sales,  have  been 
established.  Very  little  has  been  studied  about  catalog  sales,  which  is  a multimillion  dollar 
industry.  The  attention-getting  properties  of  product  display  and  the  mechanism  by  which 
display  features,  either  directly  or  indirectly,  influence  sales,  are  of  equal  interest  to  both 
academicians  and  practitioners. 

Develop  and  validate  a knowledge  base  that  would  capture  the  essence  of  the 
domain  problem  and  the  structure  implicit  in  the  data.  Finally,  in  the  absence  of  clearly 
defined  expertise  or  a well-developed  domain  theory,  machine  learning  offers  a viable 
alternative  to  non-automated  knowledge  engineering  techniques  for  discovering  the 
expertise.  Developing  a knowledge  base  that  is  useful,  comprehensible,  and  reliable  is  one 
of  the  greatest  payoffs  to  the  practitioner. 

4.2  Example  Databases 

The  data  for  the  experiments  were  obtained  from  an  ongoing  research  project  in 
the  Marketing  Department  at  the  University  of  Florida.  Two  types  of  data  have  been 
collected  and  recorded.  The  first  type  refers  to  the  display  attributes  of  products  presented 
in  sales  catalogs,  and  the  second  type  refers  to  the  attention  that  the  products  receive  from 
potential  customers. 

Data  on  display  attributes  are  obtained  directly  from  the  pages  of  sales  catalogs 
of  a general  merchandize  retailer.  These  relate  to  the  display  characteristics  of  products 
in  each  page  of  the  catalog,  such  as  the  size  of  the  product  or  the  picture,  the  number  of 
other  competing  products,  or  the  supporting  accessories  (display  attributes  are  described 


55 


in  Appendix  A).  Display  attributes  are  of  mixed  type,  i.e.  some  attributes  are  ordered 
(real-  and  integer-valued)  and  others  are  unordered  (nominal-valued).  For  each  product, 
its  relative  sales  is  recorded  together  with  the  display  attributes.  Relative  sales  is 
expressed  as  the  ratio  of  actual  sales  to  the  estimated  sales  of  a product.  Since  actual  sales 
are  not  influenced  by  display  attributes  alone  but  by  several  other  factors,  such  as  the 
economy,  demographics,  trends  in  fashion,  etc.,  we  use  the  relative  sales  instead.  The 
reasoning  behind  this  is  that  the  sales  forecasts  made  by  the  experts  have  already  taken 
into  account  the  influence  of  these  factors,  and  that  any  variability  of  the  actual  sales  from 
the  forecast  sales  could  very  likely  be  due  to  the  presentation  of  the  products  in  the 
catalog.  The  sales  figures  are  obtained  from  the  sales  records  of  products. 

Data  on  attention  are  obtained  from  laboratory  experiments.  Several  surrogate 
measures  for  attention  are  used.  These  measures  refer  to  the  number  of  consumers  that 
looked  at  that  particular  product  as  a percentage  of  all  the  consumers  that  looked  at  the 
catalog,  how  early  in  the  order  a product  is  looked  at  from  among  the  products  in  the 
page,  or  how  frequently  the  gaze  is  returned  to  a product  (measures  of  attention  are  given 
in  Appendix  B).  The  laboratory  has  been  set  up  with  eye-tracking  equipment  that  can 
monitor  the  eye  movement  as  the  eyes  scan  the  product  information  on  a page.  The 
equipment  provides  a measure  of  the  fixations  of  the  eye  with  respect  to  locations  on  a 
page  at  discrete  points  in  time.  The  experiments  have  been  conducted  using  subjects  who 
look  through  the  catalog  while  their  eye  movement  is  being  monitored.  Approximately 
forty  subjects  have  been  used  for  each  catalog.  The  measurements  are  made  for  several 


56 


subjects  as  they  look  through  the  catalog.  These  measurements  are  then  averaged  and 
mean  values  are  obtained  for  each  product. 

Two  databases  were  available  from  the  1990  and  1991  sales  catalogs  of  a leading 
general  merchandize  retailer.  The  1990  data  had  234  observations  from  108  pages  and  the 
1991  data  had  371  observations  in  132  pages.  Each  observation  referred  to  a single 
product. 

The  databases  contained  four  different  values  for  placement  of  a product.  Each 
value  corresponded  to  a different  scheme  that  was  used  to  define  the  positions  on  the 
page.  Scheme  three  (Figure  4.1)  was  selected  because  this  was  the  recommended  scheme 
in  the  most  number  of  cases.  Product  group  is  a composite  attribute  having  two  values. 
Group  1 consists  of  displays  of  pants  or  skirts  having  one  other  product  in  the  picture. 
The  rest  belong  to  Group  2. 

Figure  4.1:  Placement  Scheme  Three.  Each  page  is  divided  into  three  areas. 


Left  Page  Right  Page 


Left 

Center 

Right 

Left 

Center 

Right 

For  each  of  the  three  models  that  we  investigated,  namely,  display-attention, 
attention-sales,  and  display-sales,  we  created  data  sets  from  these  databases.  The  set  of 


57 


attributes  selected  for  each  data  set  was  based  on  the  judgement  of  the  domain  expert, 
who  in  this  case,  is  a professor  in  marketing.  Fourteen  display  attributes  were  selected  as 
being  ones  likely  to  produce  some  salient  contribution  to  the  variability  of  attention  and 
sales.  Similarly,  six  attention  attributes  were  selected. 

Structurally,  the  data  sets  appear  to  be  fairly  complex.  Characteristics  of  the  data 
such  as  the  high  dimensionality  of  its  instance  space,  the  mixed  nature  of  the  data  types 
(attention  variables  are  real-valued,  while  the  display  attributes  are  mixed,  i.e.  nominal, 
interval,  and  continuous),  and  the  apparent  non-homogeneity  (i.e.  the  relationships  among 
attributes  vary  along  the  instance  space),  contribute  to  this  complexity. 

The  first  data  set,  display-attention,  consists  of  examples  of  products  having 
fourteen  display  attributes  as  the  independent  variables  and  attention  as  the  dependent 
variable.  Attention  is  taken  to  be  the  percentage  of  consumers  that  looked  at  a product. 
The  related  literature  has  suggested  that  the  attention  a product  receives  is  driven  by  the 
manner  in  which  it  is  displayed.  However,  in  the  catalog  sales  domain,  there  has  been  no 
conclusive  evidence  in  support  of  this  relationship. 

The  second  data  set,  display-sales,  is  similar  to  the  first  data  set,  in  that  each 
obseivation  carried  the  same  set  of  display  attributes,  but  the  dependent  variable  was 
relative  sales  instead  of  attention.  The  literature  on  print  advertisements  and  supermarket 
shelf  display  have  suggested  a relationship  of  display  and  sales,  however,  these  findings 
have  not  been  extended  to  the  catalog  sales  domain. 

The  third  data  set,  attention-sales,  consists  of  observations  for  each  product,  with 
six  different  measures  of  attention  as  the  independent  variables  and  relative  sales  as  the 


58 


dependent  variable.  Again,  in  this  database  sales  is  used  as  the  class.  In  this  case  too,  the 
literature  on  consumer  behavior  and  psychology  have  suggested  that  the  decision  making 
of  the  consumer  is  influenced  by  the  interest  generated  on  the  product,  and  that  interest 
can  be  measured  by  the  attention  given  to  a product.  These  relationships,  too,  have  not 
been  firmly  established. 

The  classification  variables,  sales  and  attention,  being  real-valued,  had  to  be 
discretely  divided  into  high"  and  "low".  "High"  being  sales  above  a pre-specified 
threshold,  and  low  being  sales  below  the  threshold.  The  threshold  was  based  on  the  type 
of  concept  or  the  discriminating  characteristic  we  wished  to  learn.  For  example,  if  the 
concept  of  the  bottom  quartile  of  sales  or  the  top  quartile  of  attention  was  to  be  learned, 
then  an  appropriate  threshold  is  selected  that  would  divide  the  examples  along  the  sales 
dimension  or  the  attention  dimension  into  ones  that  belong  to  the  concept  and  ones  that 
do  not  belong  to  the  concept.  If  multiple  classes  are  required,  e.g.  top,  middle,  and 
bottom,  then  the  appropriate  number  of  thresholds  have  to  be  used  to  divide  the  examples 
into  the  corresponding  classes. 

4.3  Experimental  Methodology 


4.3.1  Learning  Algorithm 

The  inductive  learning  system  used  for  conducting  the  experiments  is  Quinlan’s 
C4.5,  which  is  a revised  version  of  the  ID3  algorithm.  The  issues  raised  in  the  literature 


59 


about  commonly  found  problems  with  real-world  data  has  been  addressed  and  solutions 
incorporated  in  C4.5.  The  original  ID3  was  unable  to  deal  with  many  of  these  problems. 

Some  of  the  important  features  deal  with  noisy  data,  missing  data,  multi-valued 
bias  of  the  selection  criteria,  probabilistic  classification,  production  rule  generation,  etc. 
The  options  available  with  the  system  allow  some  flexibility  in  dealing  with  different 
kinds  of  data  sets. 

4,3.2  Performance  Measures 

While  efficiency  and  economy  of  the  learning  system  in  terms  of  time  and  space 
complexity  are  important  factors,  in  this  study  we  are  concerned  about  the  effectiveness, 
i.e.,  the  error  performance  on  seen  and  unseen  cases,  as  well  as  the  complexity  or  the  size 
of  decision  trees  or  rules. 

The  absolute  measure  of  error  performance  of  learning  systems  is  the  true  error 
rate.  The  true  error  rate  is  statistically  defined  as  the  misclassification  rate  of  the  learning 
system  on  an  asymptotically  large  number  of  new  cases  that  converge  in  the  limit  to  the 
actual  population  distribution  (Sholom  and  Kulikowski,  1991). 

error  rate  = number  of  errors 
number  of  cases 

In  practice,  the  true  error  rate  can  never  be  computed  because  sample  sets  of  test 
cases  will  be  finite  and  relatively  small.  A surrogate  measure  of  error  is  the  apparent 
error,  which  is  the  misclassification  rate  of  the  system  on  the  sample  set  of  cases  that 


60 


were  used  to  construct  or  train  the  system.  For  an  unlimited  training  set  the  apparent  error 
will  approximate  the  true  error  rate.  Again,  as  real  training  sets  are  finite  and  relatively 
small  apparent  error  rates  will  be  quite  different  to  the  true  error  rate.  However,  the 
apparent  error  rates  tend  to  be  biased  optimistically  when  the  classifier  has  been  overfitted 
or  overspecialized  to  the  training  set. 

We  used  a measure  that  Sholom  and  Kulikowski  (1991)  have  described  as  the  train 
and  test  paradigm.  For  a real  problem  dealing  with  only  a single  unknown  population 
distribution,  this  method  supposedly  requires  fewer  observations  to  estimate  the  true  error 
rate.  In  this  method,  the  observations  are  partitioned  into  two  groups,  one  for  training  and 
the  other  for  testing,  usually,  either  a 2/3  or  a 1/3  split.  Both  should  be  independent 
random  samples  drawn  from  the  same  population.  The  error  rate  on  the  test  cases  is  called 
the  test  sample  error  rate,  and  is  a better  estimate  of  the  true  error  rate  than  apparent  error 
rate. 

For  small  and  moderately  sized  samples,  a single  test  sample  error  estimate  may 
be  misleading,  because  the  partitioning  may  be  uncharacteristic  of  the  structure.  Better 
estimates  can  be  obtained  from  multiple  tests  or  random  subsampling.  From  each  random 
sample,  a new  decision  tree  is  constructed  and  tested.  We  use  the  mean  of  the  error  rates 
of  five  train  and  test  trials  as  the  estimate  of  the  error  rate. 


61 


4,4  Preliminary  Results  and  Analysis 


4.4,1  Analysis  One 

We  cast  the  problem  as  a classification  problem  because  we  were  interested  in 
learning  the  characteristics  of  the  top  performers  and  the  bottom  performers  with  respect 
to  attention  and  sales.  Since  both,  sales  and  attention,  were  continuous-valued,  we  had  to 
use  a threshold  or  a cutoff  value  to  divide  the  variable  into  two  ranges  corresponding  to 
two  classes.  By  varying  this  threshold  value  we  could  create  different  class  relationships. 

As  we  did  not  have  any  knowledge,  a-priori,  of  the  class  relationship,  we  first 
selected  thresholds  ranging  from  10%  to  90%  at  10%  intervals  in  order  to  determine 
whether  any  natural  classes  existed.  For  each  threshold  value  we  split  the  dependent 
variable  into  two  ranges.  The  values  below  the  threshold  corresponded  to  class  1 and  the 
values  above  corresponded  to  class  2.  Then  we  used  C4.5  and  discriminant  analysis  (DA) 
on  each  threshold. 

We  did  not  find  any  natural  classification  scheme  in  that  there  was  no  point  at 
which  the  error  reduced  considerably  that  would  have  indicated  a natural  class 
relationship  for  this  data.  Instead,  we  found  that  the  error  rates  decline  as  the  threshold 
moves  towards  the  two  extremes,  and  at  10%  and  90%  levels,  the  error  is  considerably 
less  than  at  the  middle.  Figure  4.2  shows  the  mean  error  rates  over  five  trials  for  C4.5  and 
discriminant  analysis  (DA),  for  the  original  data  (uncompensated)  as  well  as  for  the 
compensated  data  (described  in  next  paragraph).  The  decline  in  error  rates  is  a result  of, 
firstly,  the  more  numerous  class  in  the  training  set  overwhelming  the  less  numerous  class 


62 


because  the  latter  does  not  have  sufficient  number  of  examples  to  define  its  class 
boundary,  and  secondly,  the  number  of  examples  in  the  test  set  belonging  to  the  less 
numerous  class  being  inadequate  to  determine  the  true  error  rate.  This  effect  becomes 
even  more  pronounced  when  the  class  separability  in  the  description  space  is  very  poor, 
i.e.  in  situations  where  the  data  is  noisy  or  inconclusive. 

Fig  4.2.  Error  rates  for  different  thresholds  for  uncompensated  (UC)  and  compensated  (C) 
data,  using  C4.5  and  discriminant  analysis.  Training  Data  1990  and  Test  Data  1990. 


ERROR  RATE  VS  THRESHOLD  - SALES  90/90 


Claco  Throe  ho  Id 

+ C4.3  - C o DA  - uc 


The  effect  is  greater  for  C4.5  than  for  DA.  This  is  a result  of  the  way  C4.5 
handles  leaf  nodes  that  have  examples  of  both  classes.  The  more  numrerous  class  is 
assigned  to  such  a node.  If  there  are  many  such  nodes  then  it  is  very  likely  that  they  all 


63 


will  be  assigned  the  more  numerous  class,  and  thus,  most  of  the  examples  in  the  less 
numerous  class  will  be  misclassified. 

We  attempted  to  compensate  for  this  effect  by  replicating  the  examples  of  the  less 
numerous  class  so  that  the  number  of  examples  in  the  two  classes  would  approximately 
be  the  same.  The  replication  was  carried  out  separately  for  the  training  and  test  sets  so 
that  the  two  did  not  share  common  observations  that  would  have  lead  to  more  optimistic 
error  rates.  This  compensation  made  the  error  rates  for  splits  that  resulted  in  large 
differences  in  the  class  sizes  appear  more  realistic. 

As  we  can  see  in  Figure  4.2,  for  all  threshold  values,  the  error  rates  produced  by 
C4.5  is  less  than  that  of  discriminant  analysis.  From  these  results  we  may  arrive  at  one 
of  two  conclusions.  First,  C4.5  produces  a much  better  fitting  model  than  does  DA.  This 
is  because  C4.5  is  able  to  describe  the  surfaces  that  bind  class  regions  better  when  they 
are  non-linear,  which  may  be  true  in  this  case.  Second,  C4.5  has  overfitted  the  noise  and 
the  improvement  in  error  is  purely  random.  As  the  error  rates  for  the  sales/attention  and 
sales/display  models  are  not  any  better  than  random  chance,  we  cannot  conclude  that  C4.5 
has  constructed  a better  predictive  model  for  sales.  However,  the  story  may  be  different 
for  the  attention/display  model. 

We  then  turned  to  the  task  of  learning  the  characteristics  of  the  top  25%  and 
bottom  25%  performers.  We  set  the  threshold  to  divide  the  continuum  accordingly.  We 
also  considered  the  50%  threshold  that  separates  the  top  50%  from  the  bottom  50%. 
These  thresholds  are  shown  in  Table  4.1. 


64 


Table  4.1  : Thresholds  for  Classification  Schemes. 


Variable  (x) 

Percentage  (p) 

25% 

50% 

75% 

Sales  1990 

.447 

.731 

1.215 

Sales  1991 

.493 

.865 

1.454 

Attention  1990 

.660 

.810 

.910 

Attention  1991 

.540 

.790 

.920 

Note:  {x<p  — > class  1;  x>=p  — > 2} 


4,4.2  Analysis  Two 

In  keeping  with  our  third  research  goal,  we  have  attempted  to  investigate  the  three 
models  that  were  described  in  the  earlier  chapter  on  the  task  domain.  For  each  model,  a 
series  of  experiments  were  carried  out  using  different  attribute  combinations.  Only  the 
results  of  attention/display  model  appears  to  be  promising.  The  results  of  the  other  two 
models,  i.e.  sales/attention  and  sales/display  looked  rather  dismal.  All  tests  were  repeated 
for  the  25%,  50%,  and  75%  threshold  points. 

From  the  two  data  sets,  1990  and  1991,  we  conducted  four  sets  of  tests: 

"90/90"  : train  and  test  samples  from  1990  data 
"91/91"  : train  and  test  samples  from  1991  data 
"90/91"  : train  on  1990  and  test  on  1991 


91/90"  : train  on  1991  and  test  on  1990 


65 


4.4.2. 1 Displav/Attention  models 


Figure  4.3  : Error  Comparison  Within  and  Across  Years  - Full  Models  : Attention/Display 


The  Display/Attention  model  produced  the  most  promising  results.  Figure  4.3 
shows  the  test  sample  error  rates  for  the  four  cases,  namely,  train/test  combinations  of 
90/90,  91/91,  90/91,  and  91/90.  For  each  case,  the  Figure  shows  the  bottom  quartile,  mid 
point,  and  top  quartile.  As  shown  in  Figure  4.3,  the  test  sample  error  rate  overall,  range 
from  about  22%  to  35%.  The  level  of  accuracy  deteriorates  by  not  more  than  5%  when 
the  test  sample  is  from  a different  year  than  the  learning  sample.  Thus,  the  results 
obtained  from  one  year  may  be  generalizable  to  another  year.  These  error  rates  are 


66 


significantly  better  than  random  chance.  For  the  case  of  a 25%  : 75%  split  of  the  data  to 
the  two  classes,  expected  error  rate  for  random  predictions  would  37.5%.  These  results 
warrant  further  investigation  of  this  model. 

4.4.2.2  Sales/Display  models 

Figure  4.4  : Error  Comparison  Within  and  Across  Years  - Full  Models  : Sales/Display 


As  shown  in  Figure  4.4,  the  test  sample  error  rate  produced  by  the  Sales/Display 
model  is  about  40%.  Not  only  is  the  performance  of  this  model  poorer  than  random 
predictions,  but  also  the  performance  deteriorates  by  7%  to  10%  when  predicting  across 


67 


years.  Thus,  the  relationships  determined  using  one  years  data  cannot  be  generalized  to 
the  other  year,  unlike  in  the  attention/display  case. 

Thus,  it  is  difficult  to  establish  a relationship  between  the  display  variables  and 
sales.  One  possible  explanation  of  the  above  observation  is  that  sales/display  is  a much 
more  difficult  concept  to  learn  than  the  attention/display,  and  that  greater  effort  in  terms 
of  the  sample  size  may  be  needed.  Another  explanation  is  that  the  data  of  any  given  year 
is  not  atypical  for  the  concept.  A more  representative  population  may  be  required  to  train 
the  learning  system.  Finally,  the  most  plausible  explanation  is  that  the  relationships  are 
weak  and  mostly  random,  and  there  are  other  extraneous  factors  that  drive  sales  more  than 
the  display  attributes. 

4,4.2. 3 Sales/Attention  models 

The  error  levels  of  the  Sales/Attention  model  is  even  worse  and  in  the  order  of 
50%  (Figure  4.5).  These  results  are  comparable  to  what  has  been  obtained  using  statistical 
discriminant  analysis.  The  predictability  of  these  models  do  not  appear  to  be  any  better 
than  random  chance.  Here  too,  we  may  offer  the  same  explanations  as  in  the  case  of 
Sales/Display.  It  is  very  likely  that  there  is  no  relationship  between  the  attention  attributes 
and  sales  in  the  catalog  domain,  as  opposed  to  such  effects  observed  in  the  domain  of 
supermarket  shelf  display  or  print  media  advertisements. 


68 


Fig.  4.5  : Error  Comparison  Within  and  Across  Years  - Full  Models  : Sales/Attention 


4,5  Detailed  Analysis  - Attention/Displav  Model 

We  decided  to  investigate  the  attention/display  model  further  because  our 
preliminary  analysis  showed  some  promising  results  for  this  model.  The  results  on  the 
other  two  models,  namely,  sales/attention  and  sales/display,  did  not  warrant  further 
investigation  because  the  models  are  too  weak  to  make  any  contribution  to  our  knowledge 


of  the  domain. 


69 


4,5.1  Predictive  Performance 

We  selected  random  samples  from  the  1990  and  1991  data  and  used  each  to  train 
separate  trees  using  C4.5.  The  learned  trees  were  then  tested  once  with  the  holdout 
sample  from  the  same  year,  and  again  with  the  complete  data  set  of  the  other  year.  In 
each  case,  the  misclassification  rate  is  noted.  The  process  is  repeated  for  five  different 
sets  of  samples.  The  mean  misclassification  rates  are  used  as  a measure  of  the  predictive 
performance  of  the  tree.  This  process  is  repeated  for  discriminant  analysis  and  the  results 
are  compared. 


Table  4.2a:  Comparison  of  the  Test  Sample  Misclassification  Rates  of  C4.5  and  DA  for 
the  Bottom  Quartile.  (Misclassification  rate  is  given  as  the  mean  over  five  trials.) 


Training  Set  90 

Training  Set  91 

Test  90 

Test  91 

Test  91 

Test  90 

C4.5 

Mean 

22.58 

28.26 

21.8 

23.08 

Std  Dev 

1.26 

0.75 

2.81 

1.30 

Di  scrim. 
Analysis 

Mean 

41.75 

37.36 

22.76 

43.66 

Std  Dev 

7.86 

3.87 

4.81 

4.75 

Table  4.2a  shows  the  mean  test  sample  misclassification  rate  of  five  random  trials 
for  discriminating  the  bottom  quartile  against  the  rest.  Similarly,  Table  4.2b  shows 
misclassification  rate  for  the  top  quartile.  According  to  these  figures,  classification  trees 
constructed  by  C4.5  appear  to  be  significantly  better  than  linear  discriminant  functions 
generated  with  DA  as  well  as  any  random  classification  scheme.  For  the  different 


70 


samples,  C4.5  misclassification  rates  are  also  more  closely  clustered  than  that  of 
discriminant  analysis.  Misclassification  rate  always  increases  only  slightly  (less  than  5%) 
when  C4.5  is  trained  with  one  year’s  data  set  and  tested  with  another. 


Table  4.2b:  Comparison  of  the  Test  Sample  Misclassification  Rates  of  C4.5  and  DA  for 
the  Top  Quartile.  (Misclassification  rate  is  given  as  the  mean  over  five  trials.) 


Training  Set  90 

Training  Set  91 

Test  90 

Test  91 

Test  91 

Test  90 

C4.5 

Mean 

22.58 

25.0 

25.28 

27.6 

Std  Dev 

1.52 

1.58 

1.15 

3.13 

Discrim. 

Analysis 

Mean 

39.83 

37.88 

26.94 

34.6 

Std  Dev 

1.88 

4.71 

2.77 

1.5 

Since  the  two  classes  are  not  equally  likely,  and  because  it  is  the  smaller  class  that 
we  are  interested  in  predicting  in  each  case  (bottom  quartile  or  top  quartile)  we  need  to 
breakdown  the  total  error  to  its  component  parts,  namely,  Type  1 error  (misclassifying 
examples  belonging  to  the  quartile  of  interest)  and  Type  2 error  (misclassifying  examples 
not  belonging  to  the  quartile  of  interest),  and  analyze  those  errors.  Table  4.2c  shows  the 
Type  1 and  Type  2 error  of  the  best  tree  generated  by  C4.5  in  the  five  trials,  for  1990  and 
1991,  and  the  expected  error  rates  for  random  predictions. 

For  C4.5,  the  Type  1 error  rate  turned  out  to  be  higher  than  50%  except  for 
bottom  quartile  when  trained  and  tested  on  1990  data,  much  higher  than  the  expected 
Type  1 error  for  random  predictions.  The  Type  2 error  rate  for  C4.5  has  remained 
relatively  less  than  that  of  random  predictions.  One  possible  explanation  for  this  high 


71 


Type  1 error  rate  is  that  the  set  of  positive  example  in  the  training  sample  is  just  not 
sufficient  to  facilitate  effective  learning.  Another  explanation  is  the  inability  to  deal  with 
large  differences  in  the  number  of  examples  belonging  to  each  class  in  the  learning  set, 
especially  under  noise.  The  more  the  algorithm  will  attempt  to  reduce  overfitting  the  noise 
by  pruning,  the  greater  the  severity  of  this  problem  will  be.  Thus,  in  a noisy  domain  there 
will  be  more  severe  restrictions  on  the  sample  sizes  and  the  relative  sizes  of  the  classes. 
Finally,  the  lack  of  sharp  delineation  of  the  classes  when  the  dependent  variable  is  real- 
valued may  also  contribute  to  this  problem.  In  the  latter  part  of  this  report  we  discuss  the 
use  of  another  method  that  can  handle  continuous-valued  dependent  variables  in  a much 
more  natural  way.  Regression  trees  (Breiman  et  al.,  1984)  is  an  analogous  inductive 
learning  method  for  handling  continuous  classes. 


Table  4.2c:  Breakdown  of  C4.5  Misclassification  Error  to  Type  1 and  Type  2 Errors  for 
Bottom  and  Top  Quartiles. 


Train/Test  Sample 

Quartile 

ID3  Misclassification  Rates 

Type  1 

Type  2 

1990/1990 

Bottom 

31.25% 

21.74 

Top 

70.00 

9.20 

1991/1991 

Bottom 

54.17 

16.51 

Top 

55.18 

14.14 

Random  Chance 

Bottom/Top 

18.75 

18.75 

72 


4,5,2  Attribute  Significance  and  Tree  Stability 

The  structure  of  the  trees  that  were  generated  by  C4.5  were  very  fragmentary  and 
difficult  to  comprehend.  Some  of  the  ordered  attributes  appear  several  times  along  a 
single  path  from  the  root  to  a leaf  node.  The  order  in  which  the  attributes  appear  in  the 
tree  hierarchy  also  differ  from  tree  to  tree  depending  on  the  learning  sample.  Sometimes, 
even  the  root  is  caused  to  change.  Thus,  it  is  difficult  to  gain  insights  into  the  underlying 
structure  of  the  domain  problem  from  the  tree  structure.  Under  these  circumstances  the 
best  we  could  do  is  to  examine  several  trees  and  determine  which  attributes  are  likely  to 
influence  the  class. 

We  will  consider  as  consistent  or  significant,  those  attributes  that  appear  in  at  least 
two  of  the  five  classification  trees  which  we  constructed  using  randomly  selected  learning 
sets  from  1990  and  1991.  Table  4.3  shows  the  attributes  selected  by  C4.5  in  all  the  trees 
using  the  14-attribute  data  for  predicting  the  top  quartile  of  attention.  Similarly,  Table  4.4 
shows  the  results  for  predicting  the  bottom  quartile  of  attention  using  the  same  data. 

From  Table  4.3  we  see  that  six  attributes,  product  size  (PRODS),  picture  size 
(PICS),  other  products  in  picture  (OOP),  product  group  (PRODG),  accessories  (ACC),  and 
other  products  in  picture  (OIP),  have  a steady  influence  on  high  attention.  In  1991,  three 
more  attributes,  swatches  (SW),  complimenting  accessories  (CIA),  and  same  products 
outside  the  picture  (SOP),  have  affected  attention.  From  1990  to  1991,  number  of 
swatches  and  same  products  outside  the  picture,  on  average,  have  decreased  only  slightly 
while  complementing  accessories  has  increased.  But  these  differences  are  marginal. 


73 


Table  4.3:  For  predicting  top  quartile,  attributes  selected  by  C4.5  in  at  least  2 of  the  5 
trials  for  90  and  91  databases.  (Numeral  within  parentheses  indicate  number  of  trials). 


Top  Quartile  - Model  A14 

90  & 91 

91 

1 

PRODS  (5,5) 

SW  (3) 

2 

PICS  (5,5) 

CIA  (4) 

3 

OOP  (2,4) 

SOP  (2) 

4 

PRODG  (2,5) 

5 

ACC  (3,3) 

6 

OIP  (3,5) 

Table  4.4:  For  predicting  bottom  quartile,  attributes  selected  by  ID3  in  at  least  2 of  the 
5 trials  for  90  and  91  databases.  (Numeral  within  parentheses  indicate  number  of  trials). 


Bottom  Quartile  - Model  A 14 

90 

90  & 91 

91 

1 

OOP  (2) 

PRODS  (5,5) 

PICS  (5) 

2 

OP  (3) 

PRODG  (3,3) 

ACC  (4) 

3 

SW  (5,5) 

CEA  (2) 

4 

CIA  (5,5) 

OIP  (4) 

5 

PRODT  (5,5) 

SIP  (3) 

6 

SOP  (5) 

The  set  of  attributes  that  consistently  affect  low  attention  do  not  completely 
overlap  with  the  attribute  set  for  high  attention.  This  is  because  of  the  heterogeneous 


74 


nature  of  the  instance  space.  While  product  size  and  product  group  influence  the  low  end 
attention  as  well,  the  other  attributes  are  swatches,  complimenting  accessories,  and 
product  type  (PRODT).  In  addition,  there  are  also  two  other  attributes  that  are  consistent 
for  1990  and  six  attributes  for  1991.  Even  though  the  maximum  number  of  other  products 
in  a display  more  than  doubled  in  1991  there  is  no  significant  change  in  the  mean.  Out 
of  the  six  additional  attributes  in  1991,  maximum  number  of  other  products  in  picture 
have  more  than  doubled,  and  has  increased  on  average.  The  other  attributes  do  not  show 
much  difference.  At  this  level  of  analysis  we  are  able  only  to  determine  which  attributes 
affect  attention  but  not  how  they  influence,  and  for  this  we  have  to  analyze  the  rules. 

4,5,3  Rule  Abstracts 

In  order  to  determine  underlying  relationships  among  the  attributes  in  the  instance 
space,  we  have  to  search  the  rule  sets  for  recurring  patterns  and  analyze  them.  The  rule 
abstracts  that  we  present  here  describe  the  patterns  in  the  underlying  rule  sets  generated 
by  C4.5  for  each  data  set  over  the  five  trials.  Only  significant  rules,  i.e.  rules  that  have 
a relatively  high  usage  and  low  error  rates,  were  considered.  Very  few  rules  satisfied 
these  conditions.  Most  rules  that  were  generated  either  had  low  usage  or  high  error  rates, 
and  thus,  they  are  regarded  as  too  specific  to  be  generalized,  or  too  impertinent  and 
inaccurate  (Appendix  C).  As  defined  earlier,  Product  Group  1 refers  to  pants  and  skirts 
with  one  other  product  in  the  picture,  and  Product  Group  2 refers  to  the  rest. 


75 


4.5.3, 1 Top  uuartile 

a)  Learning  Set  A 14  for  1990: 

Attention  to  products  tend  to  be  within  the  top  quartile  when  the  product  size  is 
not  less  than  10  percent  of  the  page  and  also  if  : 

- the  number  of  other  products  outside  the  picture  are  at  least  three  {trial  1,  3}. 

- no  other  products  are  inside  the  picture  {trial  2,  5}. 

- when  the  picture  sizes  are  above  fifteen  percent  of  the  page,  there  are  fewer  than 
three  swatches,  and  no  other  products  are  inside  the  picture  {trial  4}. 

b)  Learning  Set  A 14  for  1991: 

Attention  to  products  tend  to  be  within  the  top  quartile  when  the  picture  size  is 
not  less  than  twenty  percent  of  the  page  and  also  if  : 

- no  other  products  or  the  same  product  in  the  picture  {trial  1 }. 

- product  size  is  not  smaller  than  seven  percent  of  the  page  and  no  other  products 
are  in  the  picture  {trial  2,  3,  4,  5}. 

In  addition,  attention  to  products  tend  to  be  within  the  top  quartile  when  the 
product  size  is  not  small  and  if  : 

- there  are  no  complementary  accessories  {trial  1 }. 

4.5.3.2  Bottom  Quartile 

The  criteria  used  earlier  on  the  usage  and  error  rates  had  to  be  somewhat  relaxed 
because  very  few  if  any  rules  qualified  under  the  previous  criteria. 


76 


a)  Learning  Set  A 14  for  1990 

Attention  to  products  tend  to  be  below  the  bottom  quartile  if: 

- product  size  is  less  than  five  percent  of  the  page,  belong  to  group  one,  and  not 
more  than  two  other  products  {trial  1 }. 

- when  skirts  in  group  two  have  fewer  than  two  complimenting  accessories  and 
the  product  size  is  not  less  than  three  percent  of  the  page  {trial  1,  2,  4}. 

- when  product  is  less  than  five  percent  of  page,  belongs  to  group  one,  and  no 
more  than  four  other  products  outside  the  picture  {trial  2,  4}. 

b)  Learning  Set  A 14  for  1991 

Attention  to  products  in  group  one  tend  to  be  below  the  bottom  quartile  when  the 
products  size  is  less  than  seven  percent  of  the  page  and  also  if: 

- other  products  in  the  picture  is  more  than  two,  and  no  same  products  outside  the 
picture  {trial  1,  3,  5}. 


CHAPTER  5 

EXPERIMENTAL  METHODOLOGY,  RESULTS,  AND  ANALYSIS  - PART  II 

5.1  Methodology 

5.1.1  Measure  of  Predictive  Performance 

Relative  mean  squared  error  or  relative  error  (RE)  (Breiman  et  al.,  1984),  is  used 
as  the  basis  for  comparison  of  the  predictability  of  regression  trees  and  linear  regression. 
Relative  error  is  defined  as  the  ratio  of  the  mean  squared  error  (MSE)  of  the  model  to  the 
MSE  using  the  mean  as  a predictor.  The  mean  can  be  viewed  as  the  baseline  predictor 
of  the  response  variable  when  nothing  is  known  about  the  description  space.  Relative 
error  will  always  be  positive.  If  RE  > 1 then  the  predictive  performance  of  the  model  is 
no  better  than  that  of  the  mean.  If  RE  < 1 then  the  model  is  a better  predictor  of  the 
response  variable.  The  smaller  the  RE  is  the  better  the  predictor. 

The  true  MSE  error  of  the  model  can  be  estimated  in  several  ways.  One  way  is 
to  test  the  model  with  the  training  set  itself.  Breiman  et  al.  (1984)  describes  this  as  the 
resubstitution  estimate.  This  gives  an  optimistic  estimate  of  the  true  MSE.  A second 
method,  which  is  better,  is  a test  sample  estimation,  where  an  entirely  different  set  of 
observations  than  the  set  used  for  learning  is  used  for  testing.  We  use  the  test  sample 
estimation  in  our  experiments.  The  accuracy  of  the  estimate  of  the  true  MSE  can  be 
further  improved  by  constructing  several  trees,  each  from  a randomly  selected  training  set, 
and  taking  the  mean  of  the  test  sample  error  estimates. 


77 


78 


5.1.2  Experimental  Method 

Regression  tree  methodology  is  used  to  construct  a descriptive  model  for  attention 
in  terms  of  the  display  attributes  of  the  products  in  the  sales  catalog.  Attention  is  the 
percentage  of  consumers  that  looked  at  the  product  out  of  those  who  viewed  the  catalog. 
The  predictive  ability  of  the  regression  model  is  then  compared  with  the  results  of  the 
classical  linear  regression  model. 

Two  databases  were  available  from  the  1990  and  1991  sales  catalogs.  An  attribute 
set  of  fourteen  attributes  (Appendix  A)  were  selected  based  on  the  expert’s  prior  notion 
of  the  importance  of  attributes.  For  each  database,  learning  was  performed  using  a random 
sample.  Once  the  tree  is  grown  fully,  a different  set  of  examples  from  the  same  database 
is  used  to  prune  the  tree.  Finally,  the  pruned  tree  was  tested  on  this  set  and  also  on  the 
other  database.  Each  train  and  test  trial  was  performed  five  times  using  a different  random 
sample  each  time. 

As  in  the  case  of  classification  in  the  previous  chapter,  we  used  scheme  three  for 
Placement  (Figure  4.1).  Again,  as  before,  Product  Group  is  a composite  attribute  having 
two  values.  Group  1 consists  of  displays  of  pants  or  skirts  having  one  other  product  in 
the  picture.  The  rest  belong  to  Group  2. 

A parallel  set  of  experiments  were  conducted  with  multiple  linear  regression  using 
the  same  data  sets.  The  same  independent  variables  used  in  the  previous  experiment  were 
used  here,  and  the  independent  variable  is  attention.  Regression  models  constructed  by 
one  random  sample  were  tested  on  another  sample  of  the  same  database  and  on  the  other 


79 


database  as  in  the  case  of  regression  trees.  The  results  of  the  two  methodologies  are  then 
compared. 

The  measure  of  comparison  used  is  Relative  Error  (RE)  as  defined  earlier.  With 
this  measure  it  is  possible  to  compare  the  predictive  ability  of  two  methods  with  respect 
to  using  the  sample  mean  as  a baseline  predictor.  Relative  error  will  indicate  the  merit 
of  the  models  constructed  by  each  method  using  the  information  given  in  display 
attributes  as  against  using  no  information  at  all  in  predicting  attention. 

Ideally,  for  learning  to  be  accomplished  the  learning  set  should  be  sufficiently 
large  and  observations  should  have  sufficient  predictive  information.  Because  of  the 
inherent  nature  of  real-world  data  and  the  accompanying  complexity  of  the  underlying 
structure,  a large  training  sample  will  be  required  to  produce  a decision  tree  with 
reasonable  accuracy.  We  would  also  need  a large  test  sample  in  order  to  obtain  an 
accurate  estimate  of  the  error  rate.  Thus,  when  the  available  databases  are  moderately 
sized  or  relatively  small,  which  is  usually  the  case,  different  train  and  test  pairs  can  lead 
to  large  variations  in  the  error  rates  on  unseen  cases.  The  structure  of  the  different  trees 
also  may  have  considerable  variation,  especially  when  the  noise  in  the  data  is 
overwhelming.  In  order  to  make  a robust  estimate  of  the  error  rate,  we  have  to  construct 
several  trees  and  compute  the  mean  value  of  these  or  find  the  best  error  rate.  We  use  the 
mean  error  rate  over  five  trees  using  five  different  train/test  sample  pairs. 

Useful  knowledge  may  be  extracted  from  these  trees  by  analyzing  their  structure. 
Again,  the  validity  of  such  knowledge  may  have  to  be  determined  based  either  on  the  best 
tree  or  on  several  trees.  In  general,  if  the  trees  display  some  degree  of  uniformity,  then 


80 


validity  will  be  easy  to  establish.  But,  if  there  is  wide  variation  in  the  tree  structures,  then 
we  may  have  to  extract  knowledge  from  each  tree  based  purely  on  the  usage  and  accuracy 
of  each  piece  of  knowledge.  Even  if  the  tree  as  a whole  is  not  useful,  it  may  be  possible 
to  extract  chunks  of  knowledge  that  will  be  useful. 

Once  the  trees  have  been  constructed,  rules  are  generated  from  the  trees 
corresponding  to  their  leaves.  These  rules  are  analyzed  to  determine  recurring  patterns, 
if  any,  so  that  a degree  of  validity  can  be  established.  All  rules  may  not  have  the  same 
level  of  importance  or  significance  because  the  proportion  of  examples  for  which  each 
rule  is  true  may  be  different  for  different  rules.  Some  rules  may  cover  many  examples 
while  others  may  only  cover  just  a single  example.  Moreover,  the  mean  value  of  the 
response  variable  for  training  observations  of  a leaf  could  be  nearly  the  same  as  that  of 
the  test  observations,  or  they  could  be  wide  apart.  Within  a leaf,  the  variance  of  the 
response  may  be  high,  and  hence,  the  tree  will  produce  just  a single  predicted  value  for 
attention.  Thus,  leaves  having  large  standard  deviations  are  poor  predictors.  Also,  leaves 
that  cover  only  a few  observations  or  have  a wide  difference  in  the  mean  of  the  response 
of  train  and  test  observations  cannot  be  treated  as  significant.  Thus,  it  is  important  to  use 
the  rule  significance  as  a criteria  in  establishing  rule  validity.  In  our  analysis,  we  will  look 
at  significant  rules  only,  and  ignore  those  which  are  insignificant.  A heuristic  pertaining 
to  the  number  of  observations  covered  in  the  training  and  test  samples  and  the  accuracy 
is  adopted  in  considering  the  significance  of  rules.  A rule  will  be  regarded  as  significant 
if  it  covers  at  least  ten  percent  of  the  training  and  test  observations  ("Test  of  Frequent 


81 


Usage"),  and  its  standard  deviation  is  no  more  than  the  weighted  average  standard 
deviation  of  the  rule  set  ("Test  of  Tight  Clustering").  The  weighted  average  is  computed 
by  weighting  the  standard  deviation  of  each  rule  by  the  proportion  of  observations 
covered  by  that  rule.  This  will  avoid  any  tendency  for  the  mean  standard  deviation  to  be 
biased  by  rules  with  few  observation  having  small  dispersions. 

5.2  Results  and  Analysis 

5.2.1  Predictive  Performance 

Since  relative  error  (RE)  indicates  the  performance  of  the  model  relative  to  using 
the  mean  as  the  predictor,  and  because  the  lower  the  relative  error  the  better  the 
predictability  of  the  model  is,  it  is  seen  from  Tables  5.1  and  5.2,  that  in  all  cases  where 
training  and  test  sets  are  from  the  same  year,  both  regression  trees  (RT)  and  linear 
regression  (LR)  have  done  significantly  better  than  just  using  the  mean  as  a predictor. 
Regression  trees  have  done  even  better  than  linear  regression.  Thus,  regression  trees  can 
generate  more  accurate  predictors  than  those  built  using  classical  statistical  regression. 
The  difference  in  the  performance  is  because  linear  regression  inherently  assumes  a linear 
relationship  among  the  independent  variables  and  the  dependent  variable  while  regression 
trees  do  not  assume  such  linearity.  In  fact,  the  result  of  a regression  tree  is  a combination 
of  several  piece-wise  linear  functions,  and  this  appears  to  fit  our  problem  better  than  a 
linear  model.  Regression  trees  perform  better  than  linear  regression  in  this  problem, 
perhaps,  because  the  data  sets  contain  non-linear  relationships  and  interactions  among  the 


82 


Table  5.1  : Relative  error  measured  across  five  trials  using  Set  A 14.  Both  train  & test  sets 
for  1990.  Train:test  ratio  = 65:35 


Relative  Error  (RE) 
on  Test  Set 

Trial 

(T) 

Regress. 

Tree 

(RT) 

Linear 
Regr.  (LR) 

1 

.465 

.716 

2 

.465 

.808 

3 

.697 

.904 

4 

.651 

.719 

5 

.625 

.788 

Mean 

.581 

.787 

STD 

.109 

.077 

Table  5.2  : Mean  relative  error  across  five  trials  using  Set  A 14.  Train  & test  from  same 
and  different  years.  Train:Test  Ratio  = 65:35 


Training  Set  90 

Training  Set  91 

Test  90 

Test  91 

Test  91 

Test  90 

Regression 

Trees 

Mean  RE 

.581 

.892 

.583 

.985 

Std  Dev 

.109 

.105 

.091 

.059 

Linear 

Regression 

Mean  RE 

.787 

1.054 

.792 

.945 

Std  Dev 

.077 

.089 

.302 

.08 

83 


attributes,  and  because  regression  trees  are  better  at  capturing  them.  Linear  regression 
does  not  capture  non-linearities  and  interactions  unless  these  terms  are  explicitly  included 
in  the  model.  Other  assumptions  of  linear  regression,  such  as  multivariate  normality  of 
the  independent  variables  are  also  violated  in  our  problem.  Another  reason  why  linear 
regression  may  perform  poorly  is  its  susceptibility  to  outliers.  In  the  case  of  regression 
trees,  an  outlier  would  be  isolated  and  assigned  to  a separate  leaf  so  that  the  outcome  for 
the  other  observation  will  not  be  affected  by  this. 

However,  when  trained  on  one  year  and  tested  on  the  other,  the  performance  of 
both  regression  trees  and  linear  regression  deteriorates  rapidly  (Table  5.2).  In  the  case  of 
training  on  1991  data  and  testing  on  1990  data,  both  regression  trees  and  linear  regression 
are  hardly  any  different  from  that  of  the  mean  predictor.  Thus,  it  is  difficult  to  generalize 
the  results  of  one  year  to  another  based  on  the  performance  of  complete  trees.  One  of  the 
reasons  for  the  performance  degradation  when  trained  on  the  data  of  one  year  and  tested 
on  the  other  year  is  the  inconclusiveness  of  the  data.  That  is,  the  relationships  among  the 
independent  variables  and  the  dependent  variable  are  relatively  weak. 

Table  5.3  shows  the  mean  value  and  standard  deviation  of  the  observations 
covered  by  leaves  of  a typical  regression  tree  of  size  ten  (size  refers  to  the  number  of 
leaves).  This  is  an  optimal  tree  for  the  given  train/test  samples,  i.e.  the  tree  has  been 
pruned  for  minimum  error  on  the  test  set.  The  values  are  given  for  the  training  sample 
from  1990,  a test  sample  from  the  same  year,  and  a test  sample  from  1991.  In  many  cases 
the  standard  deviation  for  attention  of  any  given  leaf  node  is  considerable  compared  to 
the  mean  value.  Table  5.4  shows  the  proportions  of  observations  that  are  covered  by  each 


84 


Table  5.3  : Mean  and  Standard  Deviation  for  Attention  for  the  1990  Train  and  Test,  and 
1991  Test  Data. 


Rule 

Mean  Prediction 

in  Leaf 

Std.  Dev.  in  Leaf 

Train 

Test 

Test 

Train 

Test 

Test 

1990 

1990 

1991 

1990 

1990 

1991 

1 

.517 

.626 

.644 

.181 

.235 

.205 

2 

.648 

.537 

.587 

.082 

.253 

.211 

3 

.400 

.365 

.432 

- 

.161 

.197 

4 

.240 

- 

- 

- 

- 

- 

5 

.860 

.874 

.833 

.115 

.106 

.176 

6 

.433 

.460 

.468 

.147 

- 

.141 

7 

.853 

.820 

.832 

.082 

.143 

.152 

8 

.467 

.290 

.442 

.069 

.070 

.134 

9 

.649 

.607 

.567 

.119 

.121 

.152 

10 

.880 

- 

.578 

- 

- 

.210 

Table  5.4  : Observations  Contained  in  Leaves  as  a Percentage  of  the  Complete  1990  Train 
and  Test  Sets,  and  1991  Test  Set. 


Rules 

Percentage  of  Observations 
Covered  by  Rules 

Train  1990 

Test  1990 

Test  1991 

1 

4.46% 

10.39% 

17.79% 

2 

3.18 

5.19 

2.70 

3 

0.64 

5.19 

0.27 

4 

0.64 

0.00 

0.00 

5 

63.69 

55.84 

30.19 

6 

2.55 

1.30 

7.28 

7 

1.91 

6.49 

2.16 

8 

1.91 

2.60 

1.08 

9 

18.47 

12.99 

8.09 

10 

1.27 

0.00 

3.50 

85 


leaf.  Using  the  criteria  that  we  stipulated  earlier,  only  rules  five  and  nine  appear  to  be 
significant.  Though  rule  one  covers  a relatively  large  proportion  of  observations  it  does 
not  pass  the  test  of  tight  clustering.  Rule  one  corresponds  to  a random  error  term. 
Similarly,  there  are  rules  that  pass  the  test  on  clustering  but  fail  the  test  on  usage.  These 
rules  are  too  specific  to  be  useful.  These  could  correspond  to  outliers. 

Rule  five  corresponds  to  an  inordinately  large  leaf  having  65%  of  the  training 
sample  in  it.  These  large  leafs  are  created  because  the  subtree  below  this  node  in  the 
original  unpruned  tree  produced  a higher  test  sample  error,  and  has  been  pruned  in  an 
effort  to  minimize  the  error.  In  an  effort  to  minimize  error,  pruning  may  cause 
overgeneralization,  and  thus,  lose  useful  structural  detail.  In  this  case,  it  may  be  more 
meaningful  to  trade-off  error  for  more  detail  by  reducing  the  level  of  pruning.  The 
resulting  tree  would  not  be  an  optimal  tree.  If  our  objective  is  to  learn  more  about  the 
underlying  structure  and  not  to  build  an  optimal  predictor  then  we  could  deviate  from 
optimality  and  explore  the  rule  space. 

With  repeated  trials  we  could  identify  rules  that  are  significant.  In  this  particular 
instance  the  two  rules  that  we  found  to  be  significant  are  of  the  following  form  (List  of 
significant  rules  developed  in  the  experiments  are  given  in  Appendix  D): 

Rule  5: 

If  Product  Group  = 2 
and  Product  Size  > 3%  of  page 
and  Product  Type  * skirts 
===>  Attention  = 0.860 


86 


Rule  9: 

If  Product  Group  * 2 

and  Product  Size  > 3%  of  page 

and  Other  Products  in  Picture  > 0.000 

and  Swatches  < 6.000 

and  Placement  * 6 

===>  Attention  = .649 


Rule  5 is  interpreted  as  the  following:  If  a product  other  than  a skirt  or  pants  with 
one  other  product  in  picture,  has  size  bigger  than  three  percent  of  the  page,  then  attention 
is  86%.  Similarly,  rule  seven  is  interpreted  as:  If  pants  or  a skirt  with  one  other  product 
in  the  picture,  occupies  more  than  three  percent  of  page,  has  fewer  than  seven  swatches, 
and  is  not  placed  on  the  right  side  of  the  right  page,  then  attention  is  65%. 

As  we  have  stated  earlier,  we  can  derive  more  specific  rules  from  the  more  general 
rules  by  reducing  the  level  of  pruning  and  deviating  from  optimality.  Thus,  by  reducing 
the  pruning  on  the  above  tree  we  obtained  trees  of  sizes  thirteen  and  fourteen  where  rule 
five  is  split  into  two  rules  (5a  1 & 5a2)  and  three  rules  (5b  1,  5b2,  & 5b3),  respectively. 
Rules  5a2  and  5b3  are  the  same.  The  error  rates  of  these  two  trees  were  .516  and  .505 
while  the  error  of  the  optimal  tree  was  .465.  These  rules  are  given  below: 


Rule  5a  1: 

If  Product  Group  = 2 
and  Product  Size  > 3%  of  page 
and  Product  Size  < 8%  of  page 
and  Product  Type  * skirts 
===>  Attention  = 0.827 

Rule  5a2  or  5b3: 

If  Product  Group  = 2 
and  Product  Size  > 8%  of  page 
and  Product  Type  * skirts 
===>  Attention  = 0.926 


87 


Rule  5b  1: 

If  Product  Group  = 2 
and  Product  Size  > 3%  of  page 
and  Product  Size  < 8%  of  page 
and  Product  Type  * skirts 
and  Swatches  < 0 
===>  Attention  - 0.865 

Rule  5b2: 

If  Product  Group  = 2 
and  Product  Size  > 3%  of  page 
and  Product  Size  < 8%  of  page 
and  Product  Type  * skirts 
and  Swatches  > 0 
===>  Attention  = 0.798 


Table  5.5:  Mean  Prediction  and  Percentage  Observations  of  Rules  Derived  from  Rule  5 
by  Reduced  Pruning. 


Rule 

Mean  Prediction 

Percentage  Observations 

Train  1990 

Test  1990 

Train  1990 

Test  1990 

5 

.860 

.874 

63.69% 

55.84% 

5al 

.827 

.854 

42.04 

36.36 

5a2,5b3 

.926 

.912 

21.66 

19.48 

5bl 

.865 

.865 

17.83 

18.18 

5b2 

.798 

.798 

24.20 

18.18 

Rules  5a  1 and  5a2  show  that  rule  5 can  be  further  specialized  along  product  size, 
and  as  product  size  become  more  than  eight  percent  of  the  page  there  is  a ten  percent 
improvement  in  attention.  Rules  5bl  and  5b2  is  a further  specialization  of  rule  5al,  and 
indicates  that  when  the  product  size  is  between  three  to  eight  percent  of  the  page, 
swatches  can  adversely  affect  attention.  In  this  range,  no  swatches  are  better  than  any 
swatches  accompanying  the  product. 


88 


Figure  5.1  shows  the  predicted  value  of  attention  against  the  actual  value  for  the 
test  data  of  1990  for  regression  trees  and  linear  regression  using  a sample  from  the  same 
year  as  that  used  to  estimate  or  construct  the  model.  This  provides  a clear  impression  of 
the  dispersion  of  the  response  variable  in  each  leaf.  The  observations  are  very  sparse  for 
attention  values  below  .5.  Most  products  on  average  appear  to  be  receiving  70  or  80 
percent  attention.  Because  there  were  approximately  forty  subjects  for  each  data  set,  a 
2.5%  difference  in  attention  is  equivalent  to  just  one  subject.  We  also  notice  that  few 
observations  with  very  low  attention  have  high  predicted  values,  and  vice  versa.  Those 
products  which  received  low  attention  were  ones  that  were  obscured  by  other  products 
in  the  picture,  and  were  mostly  skirts  and  pants. 

The  predictions  of  the  linear  regression  model  are  widely  scattered.  This  shows 
a large  component  of  random  noise.  It  is  also  possible  that  some  uncontrolled  extraneous 
factors  may  have  influenced  the  data.  For  example,  the  set  of  subjects  used  in  the 
laboratory  experiments  were  different  for  the  two  catalogs.  The  difference  in  attention 
levels  are  also  fuzzy.  Because  the  number  of  subjects  viewing  any  given  catalog  is  not 
very  large,  a five  percent  difference  in  the  attention  level  may  be  just  a matter  of  one 
subject.  Thus,  small  differences  in  attention  may  not  be  that  significant. 

The  predictions  at  the  lower  end  of  attention  is  severely  biased  upwards,  for  both 
linear  regression  and  regression  trees,  and  more  so  for  linear  regression.  This  upward  bias, 
mostly  in  the  case  of  linear  regression,  is  because  of  the  non-linearity  in  the  relationships 
and  the  observations  with  higher  attention  have  a stronger  influence  on  the  estimation  of 
the  model.  It  is  difficult  to  make  useful  conclusions  in  this  region. 


89 


Figure  5.1:  Comparison  of  the  regression  tree  and  linear  regression  predicted  values 
plotted  against  actual  values  of  attention.  Training  data  1990  and  test  data  1990. 


ACTUAL  VS  PREDICTED  - ATTENT ION  90/ 3D 


□ RT  + LR 


Figure  5.2:  Comparison  of  the  deviation  of  the  predicted  values  from  actual  values  of 
attention  for  regression  trees  and  linear  regression.  Training  data  1990  and  test  data  1990. 


ACTUAL  VS  DEVIATION  - ATTENT I ON  90/ 9D 


□ RT  + LR 


90 


Figure  5.3:  Comparison  of  the  regression  tree  and  linear  regression  predicted  values 
plotted  against  actual  values  of  attention.  Training  data  1991  and  test  data  1991. 


ACTUAL  VS  PREDICTED  - ATTENTION  91/91 


□ FTT  + L R 


Figure  5.4:  Comparison  of  the  deviation  of  the  predicted  values  from  actual  values  of 
attention  for  regression  trees  and  linear  regression.  Training  data  1991  and  test  data  1991. 


ACTUAL  VS  DEVIATION  - ATTENTION  91/91 


□ RT  +■  LR 


91 


Figure  5.5:  Comparison  of  the  regression  tree  and  linear  regression  predicted  values 
plotted  against  actual  values  of  attention.  Training  data  1990  and  test  data  1991. 


ACTUAL  VS  PREDICTED  - ATTENTION  90/91 


□ FTT  4-  LR 


Figure  5.6:  Comparison  of  the  deviation  of  the  predicted  values  from  actual  values  of 
attention  for  regression  trees  and  linear  regression.  Training  data  1990  and  test  data  1991. 


ACTUAL  VS  DEVIATION  - ATTENTION  90/91 


3ET  AM,  TRIAL  1 


□ RT  + LR 


92 


As  we  see,  in  the  case  of  regression  trees,  the  effect  of  the  non-linearity  is  less 
because  regression  trees  model  non-linearities  better.  The  observations  for  regression  trees 
are  aligned  into  rows  corresponding  to  discrete  predicted  values.  Each  row  relates  to  a 
leaf  node  or  a production  rule.  The  predicted  value  is  the  mean  value  of  the  response  for 
that  node.  Of  the  rule  set,  only  few  rules  appear  to  be  promising  in  terms  of  then- 
predictability  and  usage.  For  example,  rule  no.  5,  the  top-most  row  in  Figure  5.1 
(predicted  value  .860)  tends  to  be  a good  predictor  of  attention  above  .7.  However,  it 
lacks  the  discriminatory  power  in  the  attention  band  of  .7  to  1.0,  i.e.  for  all  observations 
satisfying  the  rule,  it  predicts  attention  as  .860  thought  its  actual  value  could  be  between 
.7  and  1 .0.  This  could  be  because  the  difference  in  attention  may  not  be  significant  as  a 
ten  percent  difference  corresponds  to  approximately  four  subjects.  Also,  the  random  error 
component  may  be  contributing  to  this.  Similarly,  rule  no.  9 in  Figure  5. 1 (predicted  value 
.649)  appears  to  be  a good  predictor  of  the  mid-range. 

From  Figure  5.2  it  is  clear  that  for  both  rules  many  of  the  observations  have  a 
deviation  of  ± .15  from  its  true  value.  As  noted  earlier,  these  rules  predict  a single  value 
over  a wide  range  of  attention  because  further  discrimination  or  splitting  of  nodes  is  not 
warranted  by  the  random  error  in  the  data  and  because  training  samples  used  to  learn 
concepts  in  a complex  description  space  are  inadequate.  Nevertheless,  these  two  rules  do 
make  an  important  contribution  to  the  knowledge  about  the  domain.  However,  there  are 
also  rules  that  do  not  contribute  to  any  knowledge.  They  are  purely  random  or  too 
specific  to  be  of  any  consequence.  Rule  no.  1 in  Figure  1 (predicted  value  .517)  cover 
attention  over  the  entire  range.  These  are  rules  that  hardly  have  any  predictive  power. 


93 


Rule  no.  1 appears  to  have  captured  a significant  component  of  random  noise.  Most  of 
the  rules  that  we  have  been  able  to  generate  are  either  random  or  they  do  not  carry  much 
weight  because  they  cover  only  few  examples. 

Figures  5.5  and  5.6  show  the  prediction  and  deviation  for  test  data  from  1991. 
Rules  5 and  9 do  not  perform  very  well  for  this  data  set.  The  deviations  far  exceed  those 
that  were  obtained  for  the  1990  test.  This  further  illustrates  that  a significant  random  error 
component  exists  in  the  data. 

Similarly,  Figures  5.3  and  5.4  show  the  residuals  or  the  deviations  of  the  predicted 
values  from  the  actual  values  for  the  test  data  set  of  1991  produced  by  regression  trees 
and  linear  regression  constructed  using  1991  data.  Observations  are  densely  distributed 
on  a wider  range  of  attention  for  1991  than  for  1990.  Here  too,  we  find  that  some  rules 
do  well  in  predicting  attention.  For  example,  in  Figure  5.3,  rule  no.  7 (first  row  from  top) 
captures  high  attention  products  with  a few  exceptions  and  is  also  heavily  used.  This  rule 
suggests  that  as  long  as  product  size  is  not  below  eight  percent  of  the  page  attention 
would  be  almost  90%.  Rule  no.  6 (third  row  from  bottom)  also  appears  to  be  a good 
predictor  for  attention  in  the  50%  level  which  is  also  heavily  used.  According  to  this  rule, 
pants  and  skirts  with  one  other  product  in  the  picture  receive  approximately  50%  attention 
if  their  size  is  within  two  to  eight  percent  of  the  page.  A degradation  in  performance 
parallel  to  what  was  seen  when  training  on  1990  and  testing  on  1991  takes  place  when 
trained  on  1991  and  tested  on  1990,  and  thus,  confirms  our  suspicions  on  random  error. 


94 


5.2.2  Effect  of  Sample  Size  on  Predictive  Performance 

In  our  previous  experiments,  we  found  the  trees  that  we  constructed  were 
relatively  unstable,  i.e  the  structure  of  the  tree  was  different  for  different  train  and  test 
sample  sets.  One  possible  reason  that  we  suspected  was  the  sample  inadequacy.  Therefore, 
we  conducted  further  experiments  varying  the  sample  size.  We  repeated  the  experiments 
for  90  and  91  using  traimtest  sample  ratios  of  50:50,  65:35,  80:20,  and  90:10  in  each 
year,  and  measured  the  relative  error. 


Table  5.6  : Relative  Error  for  Regression  Trees  using  A 14  for  Different  Sample  Sizes. 


Train  :Test 
Ratio 

Train/Test  = 90/90 

Train/Test  = 91/91 

Mean 

Std.  Dev. 

Mean 

Std.  Dev. 

50:50 

.627 

.096 

.604 

.05 

65:35 

.581 

.109 

.583 

.091 

80:20 

.585 

.164 

.500 

.134 

90:10 

.407 

.130 

.482 

.219 

The  results  of  this  experiment  (Table  5.6)  show  a trend  of  decreasing  relative  error 
as  the  learning  sample  sizes  increase.  This  leads  to  believe  that  our  suspicions  are  well 
grounded,  i.e.  because  of  the  nature  and  complexity  of  the  description  space  much  larger 
learning  samples  may  be  needed  to  affect  better  learning.  Though  this  is  a necessary 
condition  this  is  not  a sufficient  condition  for  learning  to  occur.  The  data  has  to  be 
conclusive,  that  is,  observations  must  hold  the  predictive  information. 


95 


However,  one  caveat  has  to  be  added  to  this  result.  As  we  keep  increasing  the 
training  sample  size,  our  test  sample  size  decreases  because  of  the  limited  number  of 
observations  available.  Thus,  the  error  rate  that  we  observe  may  not  be  a good 
approximation  of  the  true  error  rate. 

5.2,3  Rule  Abstracts 

Appendix  D refers  to  the  rule  sets  constructed  from  the  1990  and  1991  data, 
respectively,  using  the  regression  tree  methodology.  As  we  have  discussed  earlier,  some 
of  these  rules  are  significant  in  terms  of  the  number  of  observations  that  each  covers  and 
how  concentrated  the  response  values  of  these  observations  are.  There  are  also  several 
rules  that  are  either  too  specific  or  too  erratic.  In  this  section  we  will  attempt  to  interpret 
the  more  important  rules. 

The  following  rule  abstracts  describe  the  patterns  in  the  underlying  rule  sets  that 
were  created  by  regression  trees  for  each  data  set  over  the  five  trials.  Only  significant 
rules,  i.e.  rules  that  approximately  satisfies  our  criteria  on  frequency  of  usage  and 
tightness  of  clustering  were  considered.  Other  rules  that  did  not  satisfy  the  criteria  may 
not  have  sufficient  validity,  and  hence,  the  knowledge  extracted  from  them  may  not  be 
useful.  The  trial(s)  for  which  the  pattern  is  observed  is  given  within  the  curly  brackets. 

In  the  following  sections.  Product  Group  1 refers  to  pants  and  skirts  accompanied 
by  just  one  other  product  in  the  picture.  Product  Group  2 refers  to  all  others. 


96 


5.2.3. 1 Rule  abstracts  for  learning  from  1990  data 

- Very  small  product  sizes  occupying  less  than  3%  of  the  page  tend  to  receive,  on 
average,  less  than  60%  attention,  and  attention  tends  to  improve  with  product  size  {trials 
1,  2,  and  4}. 

- When  product  size  is  larger  than  3%  of  the  page,  attention  on  product  group  1 
is  adversely  affected  by  the  number  of  other  products  in  the  picture.  Attention  is  high 
when  no  other  products  are  appear  in  the  picture  { trial  1 } . 

- When  product  size  is  no  less  than  3%  and  no  more  than  10%  of  the  page, 
swatches  seem  to  adversely  affect  attention  very  mildly  for  products  other  than  skirts  in 
group  2 {trial  2}. 

- In  general,  products  receive  poor  attention  only  when  the  product  size  is  very 
small,  but  otherwise,  with  the  exception  of  pants  and  skirts,  they  receive  high  attention 
(trial  3 and  5 } . 

5. 2. 3. 2 Rule  abstracts  for  learning  from  1991  data 

- As  product  size  increases,  attention  increases.  When  product  size  is  greater  than 
7%  percent  of  the  page  size,  attention  is  above  90%.  {all  trials}. 

- When  product  size  is  less  than  7%  percent  of  the  page  size,  attention  on  products 
other  than  tops  are  as  low  as  50%  to  60%.  But  attention  on  tops  is  around  80%  {all 
trials}. 

- When  the  product  size  is  as  low  as  4%-5%  of  page  size,  attention  can  be  further 
improved  for  tops  (as  high  as  90%),  by  increasing  picture  size  above  13%  of  page  and 
by  having  no  more  than  three  other  products  inside  or  outside  the  picture  {trial  1}. 


97 


- When  product  size  is  less  than  7%,  attention  on  all  products  belonging  to  group 
2 other  than  tops,  is  affected  by  picture  size  and  number  of  accessories.  Attention  is 
improved  by  increasing  picture  size  and  maintaining  fewer  accessories  {trial  1}. 

- As  the  number  of  complementary  accessories  increases,  attention  decreases. 
When  product  is  larger  than  7%  of  the  page,  products  receive  above  90%  attention  if  no 
complementary  accessories  (CEA)  are  present.  The  negative  influence  of  complementary 
accessories  is  further  reinforced  by  the  presence  of  other  accessories  {trial  2}. 

- When  product  size  is  less  than  7%,  for  all  products  belonging  to  group  2 other 
than  tops,  placement  affects  attention.  Products  on  the  right  of  the  right  page  receives  the 
least  attention.  In  other  areas  products  without  complementary  accessories  receive  better 
attention  than  those  with  complementary  accessories  {trial  4}. 

- For  all  products  other  than  tops,  more  than  one  complementary  accessory  tends 
to  lower  attention  when  product  size  is  less  than  7%  of  the  page  {trial  5}. 

5.2.3. 3 Insights  from  linear  regression 

From  linear  regression,  we  found  several  attributes  significant  at  a level  of  . 1 . The 
estimates  of  the  partial  regression  coefficients  in  the  linear  regression  model  show  that, 
as  product  size  increase  by  one  square  inch,  attention  increases  by  two  to  three  percent, 
while  other  things  remain  equal.  Complimenting  accessories  also  show  a very  small 
effect,  an  increase  in  one  unit  increases  attention  by  one  or  two  percent.  Other  products 
in  the  picture  or  outside  the  picture  have  a negative  influence  on  attention.  An  additional 
product  in  the  picture  distract  five  to  seven  percent  attention  away  from  the  product. 
Similarly,  an  extra  products  outside  the  picture  would  draw  away  approximately  three 


98 


percent  attention.  An  additional  accessory  would  cause  attention  to  drop  by  two  percent. 
Thus,  a few  extra  accessories  or  products  outside  the  picture  would  not  be  very  damaging 
to  a product,  but  a product  or  two  inside  the  picture  would  drive  away  considerable 
attention.  Product  type  and  group  show  a significant  influence  on  attention. 

For  1991,  picture  size  has  become  a significant  attribute  at  a level  of  .1.  A two  to 
three  percent  increase  in  attention  can  be  realized  by  increasing  the  picture  size  by  one 
square  inch.  Product  size  does  not  appear  to  be  as  significant  as  before.  Product  group 
and  type  continue  to  be  important  determinants  of  attention.  Other  products  in  the  picture 
as  well  as  products  outside  the  picture  are  significant,  but  the  effect  is  less  for  products 
outside  the  picture.  While  an  additional  product  inside  the  picture  can  reduce  attention  by 
six  to  seven  percent,  a product  outside  does  so  by  less  than  two  percent. 

Linear  regression  does  not  tell  the  whole  story.  For  example,  the  increase  of 
attention  over  the  entire  range  of  possible  values  of  an  attribute,  such  as  product  size,  may 
not  be  true.  However,  this  non-linearity  is  never  captured  in  the  estimate  unless  an 
additional  term  that  captures  the  non-linearity  is  specified  in  the  model.  The  estimate  also 
never  captures  the  effect  of  a change  in  another  attribute  on  the  relationship  between 
product  size  and  attention  unless  an  additional  term  that  captures  the  interaction  is 
included.  Thus,  in  order  to  capture  the  relationships  among  the  attributes,  a very 
sophisticated  and,  perhaps,  cumbersome  model  has  to  be  constructed.  Such  model  building 
will  have  to  be  supported,  at  the  outset,  by  a well-developed  domain  knowledge. 


CHAPTER  6 

CONCLUSIONS  AND  FUTURE  RESEARCH 


In  this  report,  we  have  described  and  illustrated  new  approaches  to  classification 
and  prediction  tasks  in  a very  complex  and  unstructured  problem  concerning  the  design 
of  a sales  catalog,  using  artificial  intelligence  techniques  of  machine  learning.  We  set  out 
to  acquire  domain  knowledge  about  the  underlying  relationship  between  display  attributes 
and  attention,  attention  and  sales,  and  finally,  display  attributes  and  sales.  In  doing  so,  we 
also  intended  to  understand  the  behavior  of  the  appropriate  inductive  learning  techniques 
in  this  complex  task  domain,  and  further,  to  develop  a methodology  by  which  useful 
results  can  be  obtained.  Finally,  we  needed  to  be  convinced  that  these  new  approaches 
performed  as  well  as  or  even  better  than  the  traditional  statistical  techniques,  in  addition 
to  the  other  benefits  and  advantages  they  are  claimed  to  provide. 

The  domain  problem  that  we  were  dealing  with  was  a very  novel  problem,  in  that 
very  little  previous  work  has  been  done  in  the  area,  though  some  work  has  been  done  in 
related  areas,  such  as  advertisements  in  print  media  and  shelf  displays  in  supermarkets. 
The  problem  was  also  very  challenging  because  it  was  relatively  unstructured  and 
inundated  with  all  the  idiosyncracies  that  one  would  expect  to  find  in  a real-world 
situation.  These  characteristics  made  the  task  of  learning  all  the  more  difficult,  unlike  in 
well-behaved  artificial  domains  where  most  learning  systems  have  been  put  through  the 


99 


100 


test.  At  the  same  time,  the  difficulty  and  the  challenge  of  the  problem  made  our 
experience  interesting  and  enriching. 

Besides  mixed-type  attributes  and  continuous-valued  dependent  measures,  the  data 
also  possessed  considerable  fuzziness  and  measurement  errors.  Furthermore,  the 
distribution  of  most  of  the  attributes  were  not  even  close  to  being  well-behaved,  and  the 
instance  or  the  description  space  was  heterogeneous.  Finally,  we  were  faced  with  the 
problem  of  relatively  small  sample  sizes  for  learning  and  testing,  barely  sufficient  to 
facilitate  effective  learning  and  to  obtain  accurate  error  measurements,  given  the  nature 
and  complexity  of  the  instance  space. 

We  have  adopted  a very  empirical  approach  in  finding,  through  scientific  intuition 
and  experimentation,  a solution  to  the  domain  problem,  using  AI  methodologies  of 
learning.  We  experimented  with  two  machine  learning  techniques,  namely,  classification 
tree  induction  using  ID3  and  regression  tree  induction,  and  then  compared  the  results  of 
these  with  corresponding  statistical  techniques,  namely,  discriminant  analysis  and  linear 
regression.  We  were  able  to  exploit  some  of  the  advantages  of  AI  techniques  over 
statistical  techniques  in  this  domain.  For  example,  the  domain  profusely  violated 
assumptions  on  the  distribution  that  would  invalidate  inferences  drawn  from  statistical 
models.  Discriminant  analysis  and  linear  regression  implicitly  assume  a linear  relationship 
or  a function,  which  is  rarely  true.  On  the  other  hand,  AI  techniques  never  demanded  such 
assumptions.  Furthermore,  mixed-type  attributes  in  the  domain  were  handled  in  a much 
natural  manner  in  AI  than  in  statistical  methods.  Finally,  we  could  acquire  much  more 


101 


insightful  knowledge  of  the  domain  by  using  AI  techniques  because  of  the  symbolic 
formalisms  used  to  represent  the  acquired  knowledge. 

Of  the  three  domain  models  that  we  studied,  namely,  display-attention,  attention- 
sales,  and  display-sales,  the  first  produced  the  most  promising  results.  The  results  of  the 
other  two  models  were  rather  dismal.  Even  though  related  domain  theory  on  shelf  display 
in  supermarkets  or  print  advertisements  in  newspapers  suggest  relatively  strong 
relationships  between  display  features  and  sales,  we  were  unable  to  support  similar 
conclusions  in  the  case  of  sales  catalogs.  The  reason  for  this  may  be  that  the  situation  of 
catalog  sales  is  different  to  the  others  in  some  respects.  For  instance,  products  displayed 
on  supermarket  shelves  and  on  newspaper  advertisements  compete  for  space  with  products 
of  other  vendors.  In  catalogs,  all  the  products  are  of  the  same  vendor.  In  supermarkets, 
sometimes,  the  purchase  or  the  brand  switching  decision  is  impulsive,  and  also  the 
products  have  a much  frequent  repetitive  purchase  cycle.  The  catalog  consumer  can  take 
the  time  to  go  over  the  catalog  repeatedly,  back  and  forth,  and  evaluate  the  cost-benefits 
more  consciously  before  making  the  purchase  decision.  Also,  the  catalogs  reach  a 
selective  audience  who  have,  perhaps,  already  made  a decision  to  purchase  a particular 
type  of  product.  The  catalogs  that  consumers  choose  to  view  may  have  been  picked 
because  of  brand  loyalties.  These  reasons  may  have  contributed  to  not  finding  strong 
relationship  between  attention  and  sales,  too. 

We  found  a relatively  stronger  relationship  between  the  display  attributes  and 
attention.  Product  size  is  definitely  one  of  the  key  factors  that  contribute  to  attention.  As 
product  size  increases,  attention  increases  progressively.  But,  this  relationship  is 


102 


influenced  by  other  factors.  Larger  picture  size  also  appears  to  be  promoting  attention. 
Other  products  situated  inside  the  picture  tend  to  divert  the  attention  away  from  the 
product  itself.  Accessories,  in  general,  tend  to  have  an  adverse  effect  on  attention.  Only 
products  placed  at  the  right  side  of  the  right  page  appeared  to  be  at  a disadvantage 
compared  to  other  positions  on  the  right  as  well  as  the  left  page.  Certain  products  draw 
more  attention  than  others,  perhaps,  because  they  are  popular  and  commonly  worn.  Other 
influences  on  attention  are  rather  inconclusive.  We  also  found  that  these  influences  are 
not  uniform,  for  instance,  the  effect  of  other  products  in  the  picture  on  attention  may  be 
different  for  different  product  sizes.  These  effects  that  we  have  captured  using  regression 
trees  are  more  than  just  main  effects,  they  also  capture  interactions. 

In  the  case  of  both  the  ID3  classification  trees  and  regression  trees,  the  tree 
structure  was  relatively  unstable,  i.e.  the  tree  hierarchy  was  different  for  different  samples. 
Typically,  this  occurs  when  two  attributes  have  almost  equal  importance,  and  the  priors 
in  the  training  sample  tip  the  balance  between  selecting  one  or  the  other  attribute. 
Instability  is  also  caused  when  the  training  sample  is  insufficient  to  learn  the  concept 
especially  when  the  data  is  marred  with  noise.  The  decision  trees  that  we  constructed 
varied,  sometimes  widely,  from  sample  to  sample,  making  the  task  of  generalizing  the 
results  difficult.  Some  rules  resulted  from  very  few  examples,  and  some  others  had  a 
heavy  error  level.  Some  were  very  specific  and  others  more  general.  Thus,  once  the  rules 
were  generated  we  had  to  perform  a further  level  of  analysis  to  differentiate  between  the 
usable  and  the  unusable  rules  and  to  discern  underlying  patterns.  We  developed  a criteria 
by  which  we  could  evaluate  the  rules  for  the  purpose  of  determining  their  usefulness. 


103 


Only  few  rules  were  found  to  be  useful.  We  analyzed  the  relationships  among  the 
variables  for  only  those  rules  that  were  determined  as  useful. 

The  other  issue  that  we  had  to  deal  with  in  classification  was  the  division  of  the 
continuous-valued  dependent  measure  into  two  or  more  discrete  classes.  Since  there  were 
no  immediately  recognizable  points  in  the  continuum  at  which  the  splits  could  be  made, 
nor  any  clear  method  by  which  we  could  find  them,  we  had  to  try  several  different  cutoff 
points  to  determine  whether  any  natural  cutoff  points  existed.  On  the  other  hand,  since 
we  were  interested  in  determining  the  characteristics  related  to  the  top  or  bottom  quartile 
of  the  dependent  measure,  the  cutoff  points  were  more  or  less  forced  on  us.  This  could 
produce  awkward  results.  As  this  approach  seemed  fragile,  we  extended  our  investigation 
to  regression  trees,  which  had  the  ability  to  deal  with  a continuous-valued  dependent 
variable. 

In  classification,  we  found  that  the  distribution  of  examples  across  classes  has  to 
be  fairly  uniform,  or  be  present  in  large  numbers,  especially  when  the  data  is 
indeterministic  or  contains  noise.  This  is  simply  because,  the  tree  building  process,  in 
order  to  avoid  large  and  leafy  trees  that  result  from  overfitting  the  data,  uses  some 
stopping  criteria  or  prunes  back  the  trees.  Then,  the  leaves  of  the  final  tree  which  will 
have  examples  from  both  classes,  will  be  assigned  the  more  numerous  class.  As  a result, 
the  all  or  most  examples  of  the  less  numerous  class,  and  none  or  very  few  of  the  more 
numerous  class  tends  to  be  misclassified.  This  gives  an  optimistic  twist  to  the 
performance  of  the  classifier.  In  this  case,  it  is  more  meaningful  to  analyze  the  error  in 


terms  of  its  contribution  to  each  class. 


104 


Our  analysis  of  the  two  types  of  errors.  Type  1 (misclassification  error  of 
examples  belonging  to  the  quartile  of  interest)  and  Type  2 (misclassication  error  of 
examples  not  belonging  to  the  quartile  of  interest),  showed  a different  picture  of  the  error 
landscape.  The  Type  1 error,  which  is  of  greater  interest  to  us  than  Type  2 error,  turned 
out  to  be  very  high,  and  not  any  better  than  chance.  This  might  confirm  the  suspicions 
we  have  stated  in  the  previous  paragraph.  It  could  also  mean  that  the  data  is  insufficient 
to  facilitate  learning,  or  that  the  relationships  that  we  discovered  were  spurious.  Finally, 
it  could  also  be  possible  that  the  splitting  of  the  continuous  dependent  variable  into 
discrete  classes  may  have  given  rise  to  these  awkward  results. 

From  our  analysis  it  is  evident  that  decision  tree  induction  is  more  likely  to 
produce  better  results  than  discriminant  analysis  in  finding  structural  relationships  when 
non-linearities,  interactions,  and  heterogeneities  are  present.  Similarly,  regression  trees 
perform  better  than  linear  regression  in  comparable  situations. 

Given  the  complex  nature  of  the  instance  space,  a larger  training  sample  would 
very  likely  produce  better  results.  It  could  very  well  be  that  with  the  size  of  our  training 
sample,  the  likelihood  of  learning  a close  approximation  of  the  concept  may  not  be  very 
high.  We  investigated  the  effect  of  sample  size  by  varying  the  size  of  the  learning  sample. 
This  showed  a decrease  in  the  error  with  increase  in  sample  size.  However,  we  cannot 
fully  rely  on  this  result  because  of  the  limited  number  of  observations  available.  The  test 
sample  had  to  be  decreased  when  learning  sample  was  increased.  Thus,  the  error  rates  that 
we  measured  may  not  be  good  approximations  of  the  true  error  rate,  though  we  measured 
the  average  error  rate  over  five  trials. 


105 


Our  experience  shows  that  automated  knowledge  acquisition  is  not  only  a viable 
alternative  to  statistical  approaches  as  an  analytical  tool,  but  is  also  more  than  a good 
substitution  to  traditional  knowledge  engineering  methods  in  situations  where  knowledge 
is  not  well-defined  or  non-existent.  It  is  not  a panacea  to  all  types  of  analytical  problems 
or  knowledge  acquisition  needs.  It  is  still  young,  and  there  is  much  more  to  be  desired. 
We,  in  the  academic  business  community,  should  use  these  techniques  more  often  in  our 
experimentation,  not  only  to  add  to  the  domain  knowledge  but  also  identify  weaknesses 
of  these  methods  in  terms  of  real  problems,  so  that  the  learning  methods  themselves  may 
evolve  and  become  better.  A great  pay  off  is  yet  to  be  achieved,  and  research  on 
improving  these  methods  as  well  as  their  application  should  go  hand  in  hand. 

This  research  has  contributed  to  extending  our  knowledge  and  understanding  of 
the  domain  by  bringing  forth  relationships  that  conventional  statistical  method  are  not 
capable  of.  The  rules  describe  piece-wise  linear  relationships  and  interactions.  It  has  also 
increased  our  understanding  of  the  behavior  of  the  learning  methods  in  real  domains.  This 
knowledge  will  better  equip  us  to  deal  with  the  issues  that  arise  in  similar  tasks.  We  have 
also  been  able  to  identify  some  key  shortcomings  or  difficulties  that  need  to  be  addressed 
in  improving  the  methodology. 

Future  work  should  address  some  of  the  critical  issues  that  we  were  faced  with, 
because  these  will  recur  in  many  real  problems.  One  of  the  issues  that  need  to  be 
addressed  is  how  to  deal  with  data  that  have  excessive  noise  levels.  Before  doing  this, 
we  may  have  to  thoroughly  investigate  the  noise  tolerance  characteristics  of  these 
methods.  Another  issue  is  the  effect  of  class  imbalances  that  we  described.  This  requires 


106 


assigning  multiple  classes  to  leaves  and  assigning  probabilities  or  certainty  factors  to 
them.  Then  the  question  of  interpreting  such  trees  have  to  be  addressed.  Yet  another  issue 
is  the  problem  of  inconclusive  data.  Current  research  has  addressed  this  issue  to  a limited 
extent  in  artificial  domains.  This  will  be  important  because  it  is  very  seldom  that  a data 
set  will  contain  all  the  information  needed  to  determine  important  concepts.  Discovering 
composite  terms  from  low  level  attributes  will  be  very  helpful  in  the  learning  process.  We 
have  also  seen  different  samples  producing  unstable  trees.  This  may  have  to  be  resolved 
by  some  intelligent  preprocessing  step  that  would  help  select  representative  samples. 

This  research  is  an  exploration  that  spans  two  different  disciplines,  both  relatively 
new  and  unexplored.  Cross-fertilization  of  ideas  would  necessarily  lead  to  improving  the 
supply  side  and  the  demand  side.  On  the  supply  side,  understanding  the  limitations  of  the 
Artificial  Intelligence  methods  will  very  well  contribute  to  developing  their  capabilities. 
On  the  demand  side,  business  and  industry,  can  use  Artificial  Intelligence  methodologies 
to  improve  their  knowledge  and  understanding  of  specific  domains  so  that  they  can  be 
more  competitive. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 


APPENDIX  A 

DISPLAY  ATTRIBUTE  DESCRIPTION 


ATTRIBUTE 

ATTRIBUTE 

MODELS  ATTRIBUTE 

NAME 

TYPE 

A14 

DESCRIPTION 

PL 

C 

X 

Placement  - scheme  3 

PL2 

C 

Placement  - scheme  4 

PICS 

R 

X 

Picture  size 

PRODS 

R 

X 

Product  size 

ACC 

R 

X 

Accessories 

ACCAREA 

R 

Area  of  accessories 

SWTYPE 

C 

Swatch  type 

SW 

R 

X 

# of  swatches 

SWVIEW 

C 

Swatch  view 

SWAREA 

R 

Area  of  swatches 

SWPLACE 

C 

Placement  of  swatches 

OIP 

R 

X 

Other  products  in  picture 

OOP 

R 

X 

Other  products  outside  pic. 

SIP 

R 

X 

Same  product  in  picture 

SOP 

R 

X 

Same  product  outside  pic. 

OP 

R 

X 

Other  products 

OPAREA 

R 

Area  of  other  products 

SOPPL 

C 

Placement  of  sop 

SCENIC 

C 

Scenic  background 

FUZZY 

C 

Fuzzy  background 

PRODPL 

C 

Product  placed  in  backgr. 

CONTRAST 

C 

Picture  contrast 

CIA 

R 

X 

# of  complimenting  prod.’s 

CEA 

R 

X 

# of  complementary  prod.’s 

LET 

C 

Text  description 

LETSUM 

C 

Text  summary 

MOD 

C 

Modelled 

MFACE 

C 

Model  face 

MOTHER 

C 

Other  models 

MORIENT 

C 

Model  orientation 

MACTION 

C 

Model  action 

PRODT 

C 

X 

Product  type 

PRODG 

C 

X 

Product  group 

107 


APPENDIX  B 

EYE  TRACKING  ATTRIBUTE  DESCRIPTION 


A)  Eye  tracking  attributes  for  products: 

ATTRIBUTE  ATTRIBUTE  ATTRIBUTE 
NAME  TYPE  DESCRIPTION 


1 

DWI 

Average  of  total  dwell  time  index 
= (Total  time  SDent  looking  at  area/#  of  fix) 

Individual’s  average  fixation  length 

2 

DWSA 

Average  total  dwell  time  for  area 

3 

WOR 

Average  rank  of  first  fixation  without  repetition 

4 

ATTN 

% of  people  attending  area 

5 

FIX 

Average  number  of  fixations 

6 

DWIN 

Average  amount  of  time  spent  looking  at  an  area  during  the 
first  five  fixations  after  the  initial  fixation 

B)  Eye  tracking  attributes  for  text: 

Same  attributes  as  for  products  prefixed  with  "T". 

C)  Eye  tracking  attributes  for  price: 

Same  attributes  as  for  products  prefixed  with  "P". 


108 


APPENDIX  C 

CLASSIFICATION  TREE  RULESETS  FOR  1990  AND  1991 
1990  Bottom  Ouartile 


Final  rules  from  tree  0: 
Rule  5: 

PRODS  <=  6.25 
PRODG = 1 
OP  <=  2 

->  class  1 [85.7%] 

Rule  15: 

CIA  <=  2 
PRODS  > 3.75 
PRODT  = 8 
PRODG  = 2 
->  class  1 [84.1%] 

Rule  6: 

SW  <=  4 
CIA  <=  3 
OP  <=  2 
PRODG  = 1 
->  class  1 [71.9%] 

Rule  1: 

PRODS  <=  3.75 
->  class  1 [63.8%] 


Final  rules  from  tree  1 : 
Rule  5: 

PRODS  <=  6.25 
OOP  <=  4 
PRODG  = 1 
->  class  1 [85.7%] 


Rule  21: 

PRODT  = 8 
PRODS  > 3.75 
PRODG  = 2 
CIA  <=  2 

->  class  1 [84.1%] 

Rule  6: 

SW  <=  4 
OOP  <=  4 
CIA  <=  3 
PRODG  = 1 
->  class  1 [71.9%] 

Rule  1: 

PRODS  <=  3.75 
->  class  1 [63.8%] 


Final  rules  from  tree  2: 

Rule  14: 

CIA  <=  1 
PRODS  > 3.75 
OP  <=  2 
PRODT  = 8 
->  class  1 [79.4%] 

Rule  15: 

PICS  > 35.938 
PRODS  > 3.75 
PRODS  <=  9.75 
PRODT  = 8 
->  class  1 [75.6%] 


109 


110 


Rule  9: 

SW  <=  5 
PRODT = 6 
OP  <=  2 

->  class  1 [65.1%] 

Rule  1: 

PRODS  <=  3.75 
->  class  1 [63.8%] 

Final  rules  from  tree  3: 

Rule  5: 

PRODS  <=  6.25 
OOP  <=  4 
PRODG = 1 
->  class  1 [85.7%] 

Rule  19: 

PRODT  = 8 
PRODS  > 3.75 
PRODG  = 2 
CIA  <=  2 

->  class  1 [84.1%] 

Rule  6: 

SW  <=  4 
OOP  <=  4 
CIA  <=  3 
PRODG  = 1 
->  class  1 [71.9%] 

Rule  1: 

PRODS  <=  3.75 
->  class  1 [63.8%] 

Final  rules  from  tree  4: 

Rule  14: 

SW  <=  4 
CIA  <=  3 
OP  <=  3 
PRODT  = 6 
->  class  1 [77.1%] 


Rule  22: 

PRODS  <=  9.75 
CIA  > 3 
PRODT  = 8 
->  class  1 [75.8%] 

Rule  21: 

PRODS  > 7 
CIA  <=  3 
OP  <=  3 
PRODT  = 8 
->  class  1 [70.0%] 

Rule  1: 

PRODS  <=  3.75 
->  class  1 [63.8%] 


1990  Top  Quartile 

Final  rules  from  tree  0: 

Rule  24: 

PRODS  > 12 
OOP  > 3 

->  class  2 [91.7%] 

Rule  5: 

PRODT  = 1 
->  class  2 [79.4%] 


Rule  23: 

PRODS  > 12 
ACC  > 4 

->  class  2 [63.0%] 


Final  rules  from  tree  1 : 
Rule  6: 

PRODS  > 12.938 
OIP  <=  0 

->  class  2 [87.4%] 


Rule  4: 


Final  rules  from  tree  3: 


PICS  > 17.812 
PICS  <=  25 
PRODS  > 8.312 
->  class  2 [79.4%] 

Rule  10: 

PRODT = 1 

->  class  2 [79.4%] 

Rule  12: 

PL  = 2 
PRODT  = 4 
->  class  2 [70.7%] 

Rule  16: 

PL  = 1 
CIA  > 1 
PRODT  = 5 
->  class  2 [70.7%] 

Rule  24: 

ACC  > 5 

->  class  2 [45.3%] 


Final  rules  from  tree  2: 

Rule  22: 

PRODS  > 12 
OOP  > 3 

->  class  2 [91.7%] 

Rule  1 1: 

PRODT  = 1 
->  class  2 [79.4%] 

Rule  21: 

PRODS  > 12 
ACC  > 4 

->  class  2 [63.0%] 


Rule  4: 

PICS  > 17.812 
PRODS  > 9 
SW  <=  2 
OIP  <=  0 

->  class  2 [93.3%] 

Rule  6: 

PICS  > 46.75 
OIP  <=  0 

->  class  2 [85.7%] 

Rule  9: 

PRODT  = 1 
->  class  2 [79.4%] 

Rule  18: 

PRODS  > 7.875 
CIA  > 1 
CEA  > 2 
PRODT  = 5 
SIP  <=  0 

->  class  2 [75.8%] 

Rule  11: 

PL  = 2 
PRODT  = 4 

->  class  2 [70.7%] 

Rule  15: 

PICS  <=  44 
CIA  > 1 
PRODT  = 5 
PRODS  > 5.625 
->  class  2 [70.7%] 

Final  rules  from  tree  4: 

Rule  6: 

PRODS  > 12.938 
OIP  <=  0 

->  class  2 [87.4%] 


112 


Rule  9: 

PRODT = 1 
->  class  2 [79.4%] 

Rule  1 1 : 

PL  = 2 
PRODT  = 4 

->  class  2 [70.7%] 

Rule  27: 

ACC  > 5 

->  class  2 [45.3%] 


1991  Bottom  Quartile 


Final  rules  from  tree  0: 
Rule  7: 

PRODS  <=  8.75 
OIP  > 2 
SOP  <=  0 
PRODG = 1 
->  class  1 [82.2%] 


Rule  30: 

PRODS  <=  9.625 
SW  > 6 
CIA  <=  5 

->  class  1 [82.0%] 

Rule  2: 

PICS  <=  12 
->  class  1 [54.1%] 

Final  rules  from  tree  1 : 

Rule  5: 

PICS  <=  14.5 
PRODS  > 2.25 
PRODS  <=  4.125 
->  class  1 [84.1%] 


Rule  16: 

PICS  > 39 
PRODS  <=  9.5 
ACC  <=  1 
CIA  <=  5 
PRODT  = 6 
->  class  1 [79.4%] 

Rule  21: 

ACC  <=  3 
PRODT  = 8 
CEA  > 0 

->  class  1 [70.0%] 

Rule  19: 

PRODS  <=  9.5 
OIP  > 2 
PRODT  = 6 
->  class  1 [64.5%] 


Final  rules  from  tree  2: 
Rule  7: 

PRODS  <=  8.75 
OIP  > 2 
SOP  <=  0 
PRODG  = 1 
->  class  1 [82.2%] 

Rule  29: 

PRODS  <=  9.625 
SW  > 6 
CIA  <=  5 

->  class  1 [82.0%] 

Rule  1 1 : 

PICS  <=  14.844 
PRODS  > 2 
PRODS  <=  4.125 
->  class  1 [74.0%] 


113 


Rule  9: 

PRODS  <=  2 
OOP  <=  3 
SIP  <=  0 

->  class  1 [70.7%] 
Final  rules  from  tree  3: 
Rule  30: 

PRODS  <=  11.375 
SW  > 6 
CIA  <=  5 
CEA  > 0 

->  class  1 [82.0%] 

Rule  17: 

PRODS  <=  8.75 
SW  > 0 
PICS  > 14.5 
PRODT  = 6 
SW  <=  6 
SIP  <=  0 

->  class  1 [68.4%] 

Rule  4: 

PICS  <=  14.5 
PRODS  <=  4.125 
->  class  1 [65.5%] 


Rule  12: 

PRODT  = 8 
PRODS  <=  11.375 
->  class  1 [61.3%] 


Final  rules  from  tree  4: 
Rule  7: 

OIP  > 2 

PRODS  <=  9.375 
SOP  <=  0 
PRODG = 1 

->  class  1 [82.2%] 


Rule  9: 

PICS  <=  14.5 
PRODS  <=  4.125 
PRODG  = 2 
SIP  <=  0 

->  class  1 [70.0%] 

1991  Top  Quartile 

Final  rules  from  tree  0: 

Rule  31: 

PRODS  > 13.5 
CEA  <=  0 
->  class  2 [82.2%] 

Rule  2: 

PICS  > 22.688 
OIP  <=  0 
SIP  <=  0 

->  class  2 [81.5%] 

Rule  11: 

PICS  > 48.125 
PICS  <=  52.031 
PRODG  = 2 
->  class  2 [70.0%] 

Rule  19: 

PICS  > 54.844 
PICS  <=  66 
PRODS  > 6.25 
PRODS  <=  12 
SW  <=  4 
OIP  >0 
SOP  <=  0 
CIA  > 0 
CEA  <=  2 
->  class  2 [68.4%] 

Rule  10: 

PICS  <=  48.125 
CIA  > 3 

->  class  2 [54.6%] 


114 


Final  rules  from  tree  1: 


Rule  10: 


Rule  4: 


PICS  > 22.688 
PRODS  > 8.5 
OIP  <=  0 

->  class  2 [81.5%] 


PICS  > 54.844 
PICS  <=  67.375 
PRODS  > 6.25 
SW  <=  0 
OIP  <=  2 
OOP  <=  2 
->  class  2 [79.4%] 


Rule  17: 


PRODS  > 13.5 
PRODT  = 4 
->  class  2 [79.4%] 

Final  rules  from  tree  2: 

Rule  33: 

PRODS  > 13.5 
PRODS  <=  15 
PRODG = 2 
->  class  2 [82.0%] 

Rule  4: 

PICS  > 22.688 
PRODS  > 8.5 
OIP  <=  0 

->  class  2 [81.5%] 
Final  rules  from  tree  3: 
Rule  3: 

PICS  > 22.688 
PICS  <=  56.375 
PRODS  > 8.5 
OIP  <=  0 

->  class  2 [91.7%] 
Final  rules  from  tree  4: 

Rule  4: 

PICS  > 22.688 
PRODS  > 8.5 
OIP  <=  0 

->  class  2 [81.5%] 


APPENDIX  D 

REGRESSION  TREE  RULESETS  FOR  1990  AND  1991 

1990 

Trial  1 


If  PRODG  = 2 & PRODS  <=  3.750  Mean  is  0.517  (0.626)  Std.dev  is  0.181 
(0.235)  No.  of  cases  is  9 (8) 

If  PRODG  = 2 & PRODS  > 3.750  & PRODT  = 8 & OOP  <=  4.000  & PICS  <= 
60.500  Mean  is  0.648  (0.537)  Std.dev  is  0.082  (0.253)  No.  of 
cases  is  5 (4) 

If  PRODG  = 2 & PRODS  > 3.750  & PRODT  = 8 & OOP  <=  4.000  & PICS  > 
60.500  Mean  is  0.400  (0.365)  Std.dev  is  0.000  (0.161)  No.  of 
cases  is  1 (4) 

If  PRODG  = 2 & PRODS  > 3.750  & PRODT  = 8 & OOP  > 4.000  Mean  is  0.240 
(0.000)  Std.dev  is  0.000  (0.000)  No.  of  cases  is  1 (0) 

If  PRODG  = 2 & PRODS  > 3.750  & PRODT  not  = 8 Mean  is  0.860  (0.874) 
Std.dev  is  0.115  (0.106)  No.  of  cases  is  100  (43) 

If  PRODG  not  = 2 & PRODS  <=  4.000  Mean  is  0.433  (0.460)  Std.dev  is 
0.147  (0.000)  No.  of  cases  is  4 (1) 

If  PRODG  not  = 2 & PRODS  > 4.000  & OIP  <=  0.000  Mean  is  0.853  (0.820) 
Std.dev  is  0.082  (0.143)  No.  of  cases  is  3 (5) 

If  PRODG  not  = 2 & PRODS  > 4.000  & OIP  > 0.000  & SW  <=  6.000  & PL 
= 6 Mean  is  0.467  (0.290)  Std.dev  is  0.069  (0.070)  No.  of  cases 
is  3 (2) 

If  PRODG  not  = 2 & PRODS  > 4.000  & OIP  > 0.000  & SW  <=  6.000  & PL 
not  = 6 Mean  is  0.649  (0.607)  Std.dev  is  0.1 19  (0.121)  No.  of 
cases  is  29  (10) 

If  PRODG  not  = 2 & PRODS  > 4.000  & OIP  > 0.000  & SW  > 6.000  Mean 
is  0.880  (0.000)  Std.dev  is  0.000  (0.000)  No.  of  cases  is  2 (0) 

Trial  2 


If  PRODG  -2  & PRODT  = 8 & OOP  <=  3.000  & PL  = 1 Mean  is  0.180  (0.000) 
Std.dev  is  0.000  (0.000)  No.  of  cases  is  1 (0) 

If  PRODG  = 2 & PRODT  = 8 & OOP  <=  3.000  & PL  not  = 1 & PRODS  <=  1 1.500 
& PICS  <=  60.500  Mean  is  0.674  (0.770)  Std.dev  is  0.047  (0.000) 

No.  of  cases  is  7 (1) 


115 


116 


If  PRODG  = 2 & PRODT  = 8 & OOP  <=  3.000  & PL  not  = 1 & PRODS  <=  1 1.500 
& PICS  > 60.500  Mean  is  0.530  (0.230)  Std.dev  is  0.000  (0.000) 

No.  of  cases  is  1 (1) 

If  PRODG  = 2 & PRODT  = 8 & OOP  <=  3.000  & PL  not  = 1 & PRODS  > 1 1.500 
Mean  is  0.490  (0.520)  Std.dev  is  0.000  (0.000)  No.  of  cases  is 

1 (1) 

If  PRODG  = 2 & PRODT  = 8 & OOP  > 3.000  Mean  is  0.360  (0.240)  Std.dev 
is  0.201  (0.000)  No.  of  cases  is  4 (1) 

If  PRODG  = 2 & PRODT  not  = 8 & PRODS  <=  3.750  Mean  is  0.579  (0.560) 
Std.dev  is  0.222  (0.206)  No.  of  cases  is  10  (4) 

If  PRODG  = 2 & PRODT  not  = 8 & PRODS  > 3.750  & PRODS  <=  1 1.375  & 
sw  <=  0.000  Mean  is  0.892  (0.836)  Std.dev  is  0.067  (0.117)  No. 
of  cases  is  30  (13) 

If  PRODG  = 2 & PRODT  not  = 8 & PRODS  > 3.750  & PRODS  <=  1 1.375  & 

SW  > 0.000  Mean  is  0.808  (0.796)  Std.dev  is  0.110  (0.141)  No. 
of  cases  is  34  (19) 

If  PRODG  = 2 & PRODT  not  = 8 & PRODS  > 3.750  & PRODS  > 1 1.375  Mean 
is  0.936  (0.907)  Std.dev  is  0.080  (0.085)  No.  of  cases  is  26 
(21) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  = 6 Mean  is  0.405  (0.353) 

Std.dev  is  0.124  (0.098)  No.  of  cases  is  4 (3) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  not  = 6 & CEA  <=  2.000  & 

OOP  <=  1.000  Mean  is  0.770  (0.730)  Std.dev  is  0.071  (0.000)  No. 
of  cases  is  3 (1) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  not  = 6 & CEA  <=  2.000  & 

OOP  > 1.000  & ACC  <=  0.000  & SW  <=  0.000  Mean  is  0.800  (0.818) 

Std.dev  is  0.000  (0.043)  No.  of  cases  is  2 (4) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  not  = 6 & CEA  <=  2.000  & 

OOP  > 1.000  & ACC  <=  0.000  & SW  > 0.000  Mean  is  0.655  (0.000) 

Std.dev  is  0.060  (0.000)  No.  of  cases  is  6 (0) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  not  = 6 & CEA  <=  2.000  & 

OOP  > 1.000  & ACC  > 0.000  Mean  is  0.538  (0.000)  Std.dev  is  0.100 
(0.000)  No.  of  cases  is  4 (0) 

If  PRODG  not  = 2 & PRODS  <=  1 1.000  & PL  not  - 6 & CEA  > 2.000  Mean 
is  0.522  (0.668)  Std.dev  is  0.116  (0.122)  No.  of  cases  is  10 
(4) 

If  PRODG  not  = 2 & PRODS  > 1 1.000  Mean  is  0.723  (0.707)  Std.dev  is 
0.141  (0.157)  No.  of  cases  is  12  (6) 

Trial  3 


If  PRODT  = 8 Mean  is  0.572  (0.560)  Std.dev  is  0.227  (0.185)  No.  of 
cases  is  29  (5) 


117 


If  PRODT  not  = 8 & PRODT  = 6 Mean  is  0.620  (0.667)  Std.dev  is  0.167 
(0.107)  No.  of  cases  is  31  (12) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  3.750  Mean  is  0.568  (0.588) 
Std.dev  is  0.212  (0.230)  No.  of  cases  is  10  (4) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  > 3.750  Mean  is  0.878  (0.841) 
Std.dev  is  0.101  (0.128)  No.  of  cases  is  92  (51) 


Trial  4 

If  PRODS  <=  4.250  & PL  = 6 Mean  is  0.381  (0.690)  Std.dev  is  0.122 
(0.000)  No.  of  cases  is  9 (1) 

If  PRODS  <=  4.250  & PL  not  = 6 & SW  <=  0.000  Mean  is  0.758  (0.733) 

Std.dev  is  0.204  (0.073)  No.  of  cases  is  5 (4) 

If  PRODS  <=  4.250  & PL  not  = 6 & SW  > 0.000  & PICS  <=  46.500  & ACC 
<=  0.000  Mean  is  0.747  (0.760)  Std.dev  is  0.082  (0.010)  No.  of 
cases  is  3 (2) 

If  PRODS  <=  4.250  & PL  not  = 6 & SW  > 0.000  & PICS  <=  46.500  & ACC 

> 0.000  Mean  is  0.475  (0.000)  Std.dev  is  0.015  (0.000)  No.  of 
cases  is  2 (0) 

If  PRODS  <=  4.250  & PL  not  = 6 & SW  > 0.000  & PICS  > 46.500  Mean 
is  0.380  (0.000)  Std.dev  is  0.020  (0.000)  No.  of  cases  is  2 (0) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  <=  0.000  Mean  is  0.853  (0.560) 

Std.dev  is  0.045  (0.000)  No.  of  cases  is  3 (1) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  > 0.000  & PRODS  <=  5.625  Mean 
is  0.370  (0.490)  Std.dev  is  0.010  (0.000)  No.  of  cases  is  2 (1) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  > 0.000  & PRODS  > 5.625  & ACC 
<=  1.000  & PICS  <=  63.250  & SW  <=  4.000  Mean  is  0.597  (0.722) 

Std.dev  is  0.077  (0.061)  No.  of  cases  is  9 (5) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  > 0.000  & PRODS  > 5.625  & ACC 
<=  1.000  & PICS  <=  63.250  & SW  > 4.000  Mean  is  0.670  (0.880) 

Std.dev  is  0.031  (0.000)  No.  of  cases  is  4 (2) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  > 0.000  & PRODS  > 5.625  & ACC 
<=  1.000  & PICS  > 63.250  Mean  is  0.518  (0.450)  Std.dev  is  0.101 
(0.230)  No.  of  cases  is  5 (2) 

If  PRODS  > 4.250  & PRODT  = 6 & OIP  > 0.000  & PRODS  > 5.625  & ACC 

> 1.000  Mean  is  0.765  (0.687)  Std.dev  is  0.055  (0.095)  No.  of 
cases  is  2 (3) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  = 8 & SW  <=  0.000  & PRODG 
= 2 Mean  is  0.598  (0.417)  Std.dev  is  0.210  (0.133)  No.  of  cases 
is  4 (3) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  = 8 & SW  <=  0.000  & PRODG 
not  = 2 Mean  is  0.841  (0.490)  Std.dev  is  0.088  (0.000)  No.  of 
cases  is  7 (1) 


118 


If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  = 8 & SW  > 0.000  Mean  is 
0.469  (0.602)  Std.dev  is  0.208  (0.078)  No.  of  cases  is  9 (4) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  not  = 8 & PRODS  <=  7.875 
& SW  <=  0.000  & PICS  <=  31.500  Mean  is  0.904  (0.892)  Std.dev 
is  0.031  (0.050)  No.  of  cases  is  9 (5) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  not  = 8 & PRODS  <=  7.875 
& SW  <=  0.000  & PICS  > 31.500  Mean  is  0.801  (0.773)  Std.dev  is 
0.048  (0.172)  No.  of  cases  is  7 (3) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  not  = 8 & PRODS  <=  7.875 
& SW  > 0.000  Mean  is  0.771  (0.81 1)  Std.dev  is  0.1 15  (0.068)  No. 
of  cases  is  22  (11) 

If  PRODS  > 4.250  & PRODT  not  = 6 & PRODT  not  = 8 & PRODS  > 7.875 
Mean  is  0.904  (0.900)  Std.dev  is  0.096  (0.1 18)  No.  of  cases  is 
55  (27) 

Trial  5 


If  PRODT  = 8 Mean  is  0.577  (0.557)  Std.dev  is  0.233  (0.198)  No.  of 
cases  is  22  (12) 

If  PRODT  not  = 8 & PRODT  = 6 & PRODS  <=  6.750  Mean  is  0.475  (0.623) 
Std.dev  is  0.096  (0.130)  No.  of  cases  is  6 (3) 

If  PRODT  not  = 8 & PRODT  = 6 & PRODS  > 6.750  & OOP  <=  3.000  & ACC 
<=  0.000  Mean  is  0.61 1 (0.561)  Std.dev  is  0.073  (0.181)  No.  of 
cases  is  14  (8) 

If  PRODT  not  = 8 & PRODT  = 6 & PRODS  > 6.750  & OOP  <=  3.000  & ACC 
> 0.000  Mean  is  0.728  (0.850)  Std.dev  is  0.092  (0.030)  No.  of 
cases  is  6 (2) 

If  PRODT  not  = 8 & PRODT  = 6 & PRODS  > 6.750  & OOP  > 3.000  Mean  is 
0.843  (0.000)  Std.dev  is  0.043  (0.000)  No.  of  cases  is  4 (0) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  5.625  & ACC  <=  2.000 
& PRODS  <=  1.500  Mean  is  0.360  (0.000)  Std.dev  is  0.000  (0.000) 

No.  of  cases  is  1 (0) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  5.625  & ACC  <=  2.000 
& PRODS  > 1.500  & PICS  <=  8.750  Mean  is  0.490  (0.000)  Std.dev 
is  0.000  (0.000)  No.  of  cases  is  1 (0) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  5.625  & ACC  <=  2.000 
& PRODS  > 1.500  & PICS  > 8.750  & PL  = 6 Mean  is  0.635  (0.290) 

Std.dev  is  0.055  (0.070)  No.  of  cases  is  2 (2) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  5.625  & ACC  <=  2.000 
& PRODS  > 1.500  & PICS  > 8.750  & PL  not  = 6 Mean  is  0.797  (0.732) 
Std.dev  is  0.073  (0.188)  No.  of  cases  is  25  (5) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  <=  5.625  & ACC  > 2.000  Mean 
is  0.330  (0.643)  Std.dev  is  0.000  (0.142)  No.  of  cases  is  1 (3) 


119 


If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  > 5.625  & OOP  <=  2.000  Mean 
is  0.839  (0.874)  Std.dev  is  0.1 12  (0.138)  No.  of  cases  is  26 
(14) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  > 5.625  & OOP  > 2.000  & 
PRODS  <=  7.875  Mean  is  0.857  (0.780)  Std.dev  is  0.079  (0.152) 

No.  of  cases  is  14  (11) 

If  PRODT  not  = 8 & PRODT  not  = 6 & PRODS  > 5.625  & OOP  > 2.000  & 
PRODS  > 7.875  Mean  is  0.928  (0.939)  Std.dev  is  0.066  (0.061) 

No.  of  cases  is  32  (20) 


1991 


Trial  1 


If  PRODS  <=  9.625  & PRODT  = 5 & PICS  <=  16.625  Mean  is  0.71 1 (0.620) 
Std.dev  is  0.094  (0.196)  No.  of  cases  is  7 (3) 

If  PRODS  <=  9.625  & PRODT  = 5 & PICS  > 16.625  & PRODS  <=  3.938  Mean 
is  0.782  (0.702)  Std.dev  is  0.137  (0.208)  No.  of  cases  is  12 
(6) 

If  PRODS  <=  9.625  & PRODT  = 5 & PICS  > 16.625  & PRODS  > 3.938  & OIP 
<=  3.000  & OOP  <=  3.000  Mean  is  0.919  (0.883)  Std.dev  is  0.059 
(0.106)  No.  of  cases  is  16  (6) 

If  PRODS  <=  9.625  & PRODT  = 5 & PICS  > 16.625  & PRODS  > 3.938  & OIP 
<=  3.000  & OOP  > 3.000  & PICS  <=  33.000  Mean  is  0.806  (0.455) 

Std.dev  is  0.060  (0.175)  No.  of  cases  is  5 (2) 

If  PRODS  <=  9.625  & PRODT  = 5 & PICS  > 16.625  & PRODS  > 3.938  & OIP 
<=  3.000  & OOP  > 3.000  & PICS  > 33.000  Mean  is  0.880  (0.885) 

Std.dev  is  0.053  (0.065)  No.  of  cases  is  12  (2) 

If  PRODS  <=  9.625  & PRODT  = 5 & PICS  > 16.625  & PRODS  > 3.938  & OIP 
> 3.000  Mean  is  0.795  (0.000)  Std.dev  is  0.043  (0.000)  No.  of 
cases  is  4 (0) 

If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  = 1 Mean  is  0.541  (0.493) 
Std.dev  is  0.183  (0.130)  No.  of  cases  is  46  (27) 

If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  not  = 1 & PICS  <=  15.094 
Mean  is  0.509  (0.520)  Std.dev  is  0.147  (0.149)  No.  of  cases  is 
16(9) 

If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  not  = 1 & PICS  > 15.094 
& ACC  <=  4.000  Mean  is  0.738  (0.657)  Std.dev  is  0.183  (0.257) 

No.  of  cases  is  61  (21) 

If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  not  = 1 & PICS  > 15.094 
& ACC  > 4.000  & PRODS  <=  5.063  Mean  is  0.763  (0.540)  Std.dev 
is  0.133  (0.000)  No.  of  cases  is  3 (1) 


120 


If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  not  = 1 & PICS  > 15.094 
& ACC  > 4.000  & PRODS  > 5.063  & PRODS  <=  6.125  Mean  is  0.315 
(0.000)  Std.dev  is  0.162  (0.000)  No.  of  cases  is  4 (0) 

If  PRODS  <=  9.625  & PRODT  not  = 5 & PRODG  not  = 1 & PICS  > 15.094 
& ACC  > 4.000  & PRODS  > 5.063  & PRODS  >6.125  Mean  is  0.660  (0.710) 
Std.dev  is  0.141  (0.250)  No.  of  cases  is  3 (2) 

If  PRODS  > 9.625  Mean  is  0.894  (0.881)  Std.dev  is  0.125  (0.158)  No. 
of  cases  is  64  (39) 


Trial  2 


If  PRODS  <=  9.500  & PRODT  = 5 Mean  is  0.809  (0.821)  Std.dev  is  0.154 
(0.142)  No.  of  cases  is  57  (18) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  = 2 & PRODS  <=  2.500  Mean 
is  0.474  (0.497)  Std.dev  is  0.088  (0.160)  No.  of  cases  is  10 
(9) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  = 2 & PRODS  > 2.500  & SW 
<=  2.000  & PL  = 6 Mean  is  0.587  (0.519)  Std.dev  is  0.184  (0.195) 

No.  of  cases  is  8 (9) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  = 2 & PRODS  > 2.500  & SW 
<=  2.000  & PL  not  = 6 & CEA  <=  0.000  Mean  is  0.806  (0.837)  Std.dev 
is  0.094  (0.144)  No.  of  cases  is  17  (9) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  = 2 & PRODS  > 2.500  & SW 
<=  2.000  & PL  not  = 6 & CEA  > 0.000  Mean  is  0.691  (0.648)  Std.dev 
is  0.205  (0.251)  No.  of  cases  is  30  (19) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  = 2 & PRODS  > 2.500  & SW 
> 2.000  Mean  is  0.415  (0.560)  Std.dev  is  0.1 15  (0.050)  No.  of 
cases  is  2 (2) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODG  not  = 2 Mean  is  0.534  (0.487) 
Std.dev  is  0.158  (0.178)  No.  of  cases  is  47  (24) 

If  PRODS  > 9.500  & CEA  <=  0.000  Mean  is  0.926  (0.907)  Std.dev  is 
0.056  (0.071)  No.  of  cases  is  44  (33) 

If  PRODS  > 9.500  & CEA  > 0.000  & ACC  <=  0.000  Mean  is  0.890  (0.865) 
Std.dev  is  0.124  (0.124)  No.  of  cases  is  8 (6) 

If  PRODS  > 9.500  & CEA  > 0.000  & ACC  > 0.000  Mean  is  0.732  (0.812) 

Std.dev  is  0.228  (0.332)  No.  of  cases  is  14  (5) 

Trial  3 


If  PRODS  <=  9.000  & PRODT  = 5 Mean  is  0.793  (0.855)  Std.dev  is  0.169 
(0.084)  No.  of  cases  is  52  (23) 

If  PRODS  <=  9.000  & PRODT  not  = 5 Mean  is  0.585  (0.623)  Std.dev  is 
0.207  (0.215)  No.  of  cases  is  120  (60) 


121 


If  PRODS  > 9.000  Mean  is  0.874  (0.872)  Std.dev  is  0.162  (0.122)  No. 
of  cases  is  78  (38) 

Trial  4 


If  PRODS  <=  9.500  & PRODT  = 5 Mean  is  0.812  (0.810)  Std.dev  is  0.158 
(0.132)  No.  of  cases  is  53  (22) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODS  <=  2.500  Mean  is  0.446 
(0.501)  Std.dev  is  0.150  (0.113)  No.  of  cases  is  19  (8) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODS  > 2.500  & PRODG  = 2 & PL 
= 6 Mean  is  0.532  (0.583)  Std.dev  is  0.207  (0.137)  No.  of  cases 
is  12  (6) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODS  > 2.500  & PRODG  = 2 & PL 
not  = 6 & CEA  <=  0.000  Mean  is  0.822  (0.807)  Std.dev  is  0.1 14 
(0.1 15)  No.  of  cases  is  17  (9) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODS  > 2.500  & PRODG  = 2 & PL 
not  = 6 & CEA  > 0.000  Mean  is  0.677  (0.629)  Std.dev  is  0.223 
(0.226)  No.  of  cases  is  37  (15) 

If  PRODS  <=  9.500  & PRODT  not  = 5 & PRODS  > 2.500  & PRODG  not  = 2 
Mean  is  0.546  (0.507)  Std.dev  is  0.156  (0.169)  No.  of  cases  is 
40  (23) 

If  PRODS  > 9.500  Mean  is  0.873  (0.907)  Std.dev  is  0.162  (0.088)  No. 
of  cases  is  73  (37) 

Trial  5 


If  PRODS  <=  9.375  & PRODT  = 5 Mean  is  0.806  (0.822)  Std.dev  is  0.159 
(0.134)  No.  of  cases  is  49  (26) 

If  PRODS  <=  9.375  & PRODT  not  = 5 & CEA  <=  1.000  & PRODS  <=  2.500 
Mean  is  0.495  (0.498)  Std.dev  is  0.136  (0.145)  No.  of  cases  is 
11  (9) 

If  PRODS  <=  9.375  & PRODT  not  = 5 & CEA  <=  1.000  & PRODS  > 2.500 
& PL  = 6 Mean  is  0.558  (0.540)  Std.dev  is  0.140  (0.195)  No.  of 
cases  is  12  (4) 

If  PRODS  <=  9.375  & PRODT  not  = 5 & CEA  <=  1.000  & PRODS  > 2.500 
& PL  not  = 6 Mean  is  0.734  (0.765)  Std.dev  is  0.204  (0.140)  No. 
of  cases  is  46  (15) 

If  PRODS  <=  9.375  & PRODT  not  = 5 & CEA  > 1.000  Mean  is  0.513  (0.568) 
Std.dev  is  0.185  (0.198)  No.  of  cases  is  55  (33) 

If  PRODS  > 9.375  Mean  is  0.876  (0.894)  Std.dev  is  0.141  (0.153)  No. 
of  cases  is  78  (33) 


REFERENCES 


Braun,  Helmut,  and  Chandler,  John  S.  (1987).  Predicting  stock  market  behavior  through 
rule  induction:  An  application  of  the  leaming-from-example  approach,  Decision 
Sciences:  18:  415-429. 

Breiman,  L.;  Friedman,  J.H.;  Olshen,  R.A.;  and  Stone,  C.J.  (1984).  Classification  and 
Regression  Trees.  Belmont,  CA:  Wadsworth. 

Buchanan,  B.  G.,  and  Mitchell,  T.  M.  (1978).  Model-directed  learning  of  production 
rules.  In  D.  A.  Waterman  and  F.  Hayes-Roth  (eds.)  Pattern-directed  Inference 
Systems.  New  York:  Academic  Press,  297-312. 

Buchanan,  B.G.,  and  Shortliffe,  E.H.,  (eds.)  (1984).  Rule-Based  Expert  Systems:  The 
MYCIN  Experiments  of  the  Stanford  Heuristic  Programming  Project.  Reading, 
MA:  Addison-Wesley. 

Burke,  Raymond  R.;  Rangaswamy,  Arvind;  Wind,  Jerry;  and  Eliashberg,  Jehoshua  (1990). 
A knowledge-nased  system  for  advertising  design,  Marketing  Science:  9:  212-229. 

Carbonell,  Jaime  G.  (1989).  Introduction:  paradigms  for  machine  learning.  Artificial 
Intelligence:  40:  1-9. 

Cohen,  Paul  R.,  and  Feigenbaum,  Edward  A.,  (eds.)  (1982).  The  Handbook  of  Artificial 
Intelligence,  Vol.  3.  Reading,  MA:  Addison-Wesley. 

Cox,  Keith  K.  (1970).  The  effect  of  shelf  space  upon  sales  of  branded  products.  Journal 
of  Marketing  Research,  7:55-58. 

Curhan,  Ronald  C.  (1973).  Shelf  space  allocation  and  profit  maximization  in  mass 
retailing,  Journal  of  Marketing,  37:54-60. 

Currim,  Imran  S.;  Meyer,  Robert  J.;  and  Le,  Nhan  T.  (1988).  Disaggregate  tree-structured 
modelling  of  consumer  choice  data,  Journal  of  Marketing  Research,  25:253-265. 

Davis,  R.,  and  Lenat,  D.  B.  (1982).  Knowledge-Based  Systems  in  Artificial  Intelligence. 
New  York:  McGraw-Hill. 


122 


123 


DeJong,  G.,  and  Mooney,  R.  (1986).  Explanation-based  learning:  An  alternative  view, 
Machine  Learning  1:145-176. 

Diamond,  Daniel  S.  (1968).  A quantitative  approach  to  magazine  advertisement  format 
selection,  Journal  of  Marketing  Research  5:376-386. 

Diday,  E.,  (Ed.)  (1984).  Data  Analysis,  Learning  Symbolic  and  Numeric  Knowledge.  New 
York:  Nova  Science  Publishers,  Inc. 

Duda,  R.O.;  Gasching,  J.;  and  Hart,  P.E.  (1979).  Model  design  in  the  PROSPECTOR 
consultant  system  for  mineral  exploration.  In  Mitchie  (1979). 

Falkenhainer,  B.  C.,  and  Michalski,  R.S.  (1986).  Integrating  quantitative  and  qualitative 
discovery  : The  ABACUS  system,  Machine  Learning  1:  367-401. 

Falkenhainer,  Brian  C.,  and  Rajamoney,  Shankar,  (1990).  The  interdependencies  of  theory 
formation,  revision,  and  experimentation,  Proceedings  of  the  Seventh  International 
Conference  on  Machine  Learning.  Austin:  Morgan  Kaufmann. 

Fisher,  Douglas  H.  (1987).  Knowledge  acquisition  via  incremental  conceptual  clustering. 
Machine  Learning  2:139-172. 

Fisher,  Douglas,  and  Langley,  Pat,  (1990).  Approaches  to  coneptual  clustering.  Working 
Paper,  Department  of  Information  and  Computer  Science,  University  of  California, 
Irvine. 

Gennari,  John  H.;  Langley,  Pat;  and  Fisher,  Douglas,  (1989).  Models  of  incremental 
concept  formation,  Artificial  Intelligence  40:11-61. 

Gordon,  A.D.  (1981).  Classification.  London:  Chapman  and  Hall. 

Hand,  D.J.  (1981).  Discrimination  and  Classification.  New  York:  John  Wiley  & Sons. 

Hart,  A.  (1984).  Experience  in  the  use  of  an  inductive  system  in  knowledge  engineering. 
In  Bramer,  M.,  (ed.)  Research  and  Developments  in  Expert  Systems.  Cambridg 
University  Press. 

Haussler,  David,  (1988).  Quantifying  inductive  bias:  AI  learning  algorithms  and  Valiant’s 
learning  framework,  Artificial  Intelligence  36:177-221. 

Hayes-Roth,  F.,  Waterman,  D.,  and  Lenat,  D.,  (Eds.)  (1983).  Building  Expert  Systems. 
Reading,  MA:  Addison-Wesley. 


124 


Holland,  J.H.  (1986).  Escaping  brittleness:  The  possibilities  of  general-purpose  learning 
algorithms  applied  to  parallel  rule-based  systems.  In  Michalski,  R.  S.;  Carbonell, 
J.  G.;  and  Mitchell,  T.  M.  (1986).  Machine  Learning:  An  Artificial  Intelligence 
Approach,  Volume  II.  Los  Altos,  CA:  Morgan  Kaufmann. 

Janiszewski,  Chris,  (1989).  The  influence  of  display  variables  on  purchase  behavior, 
Unpublished  manuscript. 

Janiszewski,  Chris,  (1990).  Product  display,  eye  gaze,  aand  sales,  Unpublished  manuscript. 

Kelly,  Kevin  T.,  (1990).  Theory  discovery  and  the  hypothesis  language,  Proceedings  of 
the  Seventh  International  Conference  on  Machine  Learning.  Austin:  Morgan 
Kaufmann. 

Kidd,  Alison  L.  (1987).  Knowledge  Acquisition  for  Expert  Systems:  A Practical 
Handbook.  New  York:  Plenum  Press. 

Langley,  Pat,  and  Zytkow,  Jan  M.  (1989).  Data-driven  approaches  to  empirical  discovery, 
Artificial  Intelligence  40:283-312. 

Lebowitz,  Michael,  (1987).  Experiments  with  incremental  concept  formation:  UNIMEM, 
Machine  Learning  2:103-137. 

Lenat,  D.  B.  (1977).  On  automated  scientific  theory  formation:  A case  study  using  the 
AM  program,  Machine  Intelligence  9:251-256. 

Lenat,  D.  B.  (1982).  AM:  An  artificial  intelligence  approach  to  discovery  in  mathematics 
and  heuristic  search.  In  Davis,  R„  and  Lenat,  D.  B.,  (1982).  Knowledge-Based 
Systems  in  Artificial  Intelligence.  New  York:  McGraw-Hill. 

Lindsay,  R.  K.;  Buchanan,  B.  G.;  Feigenbaum,  E.  A.;  and  Lederberg,  J.  (1980). 
Applications  of  artificial  intelligence  for  organic  chemistry:  The  DENDRAL 
project.  New  York:  McGraw-Hill. 

Lockett,  A.  G.,  and  Holland,  C.  P.  (1991).  Competitive  advantage  using  information 
technoology:  myth  or  reality?.  The  International  Review  of  Retail,  Distribution, 
and  Consumer  Research,  1:261:282. 

Luger,  George  F.,  and  Stubblefield,  William  A.  (1989).  Artificial  Intelligence  and  the 
Design  of  Expert  Systems.  Redwood  City,  CA:  Benjamin/Cummings. 

McCorkle,  Denny  E.  (1990).  The  role  of  perceived  risk  in  mail  order  catalog  shopping. 
Journal  of  Direct  Marketing,  4:26-35. 


125 


Messier,  William  F.,  Jr.  and  Hansen,  J.V.  (1988).  Inducing  rules  for  expert  system 
development:  An  example  using  default  and  bankruptcy  data.  Management 
Science:  34:1403-1415. 

Michalski,  Ryszard  S.  (1983).  A theory  and  methodology  of  inductive  learning,  Artificial 
Intelligence,  20:1 11-158. 

Michalski,  Ryszard  S.,  and  Tepp,  Robert  E.,  (1983a).  Automated  construction  of 
classifications:  conceptual  clustering  versus  numerical  taxonomy,  IEEE 
Transactions  on  Pattern  Analysis  and  Machine  Intelligence:  5:396-409). 

Michalski,  R.  S.;  Carbonell,  J.  G.;  and  Mitchell,  T.  M.  (1983b).  Machine  Learning:  An 
Artificial  Intelligence  Approach.  Palo  Alto,  CA:  Tioga. 

Michalski,  R.  S.;  Carbonell,  J.  G.;  and  Mitchell,  T.  M.  (1986).  Machine  Learning:  An 
Artificial  Intelligence  Approach,  Volume  II.  Los  Altos,  CA:  Morgan 
Kaufmann. 

Michalski,  R.  S.,  and  Kodratoff,  Y.  (1990).  Machine  Learning:  An  Artificial  Intelligence 
Approach,  Volume  III.  Los  Altos,  CA:  Morgan  Kaufmann. 

Michie,  D.,  ed.  (1979).  Expert  Systems  in  Micro-Electronic  Age.  Edinburgh:  Edinburgh 
University  Press. 

Mingers,  John,  (1986).  Expert  systems:  Experiments  with  rule  induction.  Journal  of 
Operational  research  37:1031-1037. 

Mingers,  John,  (1987a).  Expert  systems:  Rule  induction  with  statistical  data,  Journal  of 
Operational  Research  38:39-47. 

Mingers,  John,  (1987b).  Rule  induction  with  statistical  data:  A comparison  with  multiple 
regression,  Journal  of  Operational  Research  38:347-351. 

Mingers,  John,  (1989a).  An  empirical  comparison  of  selection  measures  for  decision-tree 
induction,  Machine  Learning  3:319-342. 

Mingers,  John,  (1989b).  An  empirical  comparison  of  pruning  methods  for  decision  tree 
induction,  Machine  Learning  4:227-243. 

Minsky,  M.,  and  Papert,  S.  (1969).  Perceptrons.  Cambridge,  MA:  MIT  Press. 

Mitchell,  T.  M.  (1979).  An  analysis  of  generalization  as  a search  problem,  ICJAI  6:577- 
582. 


126 


Mitchell,  T.;  Keller,  R.;  and  Kedar-Cabelli,  S.  T.  (1986).  Explanation-based 
generalization:  A unifying  view,  Machine  Learning  1:47-80. 

Natarajan,  Balas  K.  (1991).  Machine  Learning:  A Theoretical  Approach.  San  Mateo,  CA: 
Morgan  Kaufmann. 

Newell,  A.,  and  Simon,  H.  (1963a).  Empirical  explorations  with  the  Logic  Theory 
Machines:  A case  study  in  heuristics.  In  Feigenbaum  E.  A.,  Feldman,  J., 
Computers  and  Thought.  New  York:  McGraw-Hill. 

Newell,  A.,  and  Simon,  H.  (1963b).  GPS:  A program  that  simulates  human  thought.  In 
Feigenbaum  E.  A.,  Feldman,  J.,  Computers  and  Thought.  New  York:  McGraw- 
Hill. 

Newell,  A.,  and  Simon,  H.  (1976).  Computer  science  as  empirical  inquiry:  Symbols  and 
search,  CACM  19:113-126. 

Oren,  Chaim,  (1898).  The  dialect  of  retail  evolution,  Journal  of  Direct  Marketing,  3:15-29. 

Peterson,  Robert  A.;  Albaum,  Gerald;  and  Ridgeway,  Nancy  M.  (1989).  Consumers  who 
buy  from  directsales  companies.  Journal  of  Retailing,  65:273-286. 

Quinlan,  J.R.  (1979).  Discovering  rules  by  induction  from  large  collections  of  examples. 
In  Mitchie,  D.,  Expert  Systems  in  the  Micro-electronic  Age,  London:  Edinburgh 
University  Press. 

Quinlan,  J.R.  (1983).  Learning  efficient  classification  procedures  and  their  application  to 
chess  end  games.  In  Michalski,  R.  S.;  Carbonell,  J.  G.;  and  Mitchell,  T.  M. 
(1983b).  Machine  Learning:  An  Artificial  Intelligence  Approach.  Palo  Alto,  CA: 
Tioga. 

Quinlan,  J.  R.  (1986).  Induction  of  decision  trees,  Machine  Learning  1:81-106. 

Quinlan,  J.  R.  (1987).  Simplifying  decision  trees.  International  Journal  of  Man-Machine 
Studies  27:221-234. 

Quinlan,  J.  R.  (1988).  Decision  trees  and  multi-valued  attributes.  Machine  Intelligence 
11:305-318. 

Quinlan,  J.  R.  (1990).  Probabilistic  decision  trees.  In  Kodratoff,  Yves,  Michalski, 
Ryszard,  Machine  Learning:  An  Artificial  Intelligence  Approach  Vol.  3.  San 
Mateo,  CA:  Morgan  Kauffman. 


127 


Quinlan,  J.  Ross,  (1993).  C4.5  Programs  for  Machine  Learning.  San  Mateo,  CA:  Morgan 
Kaufmann. 

Race,  Philip  R.,  and  Thomas,  Richard  C.  (1988).  Rule  induction  in  investment  appraisal, 
Journal  of  Operational  Research  Society,  39:1113-1123. 

Reichgelt,  Han,  (1991).  Knowledge  Representation:  An  AI  Perspective.  Norwood,  New 
Jersey:  Ablex  Publishing  Corporation. 

Rivest,  Ronald  L.  (1987).  Learning  Decision  Lists,  Machine  Learning  2:229-246. 

Rumelhart,  David  E.,  and  McClelland,  James  L.  (1986).  Parallel  Distributed  Processing: 
Explorations  in  the  Microstructure  of  Cognition.  Cambridge,  MA:  MIT  Press. 

Schlimmer,  J.  C.,  and  Fisher,  D.  (1986).  A case  study  of  incremental  concept  induction, 
In  Proceedings  of  the  Fifth  National  Conference  on  Artificial  Intelligence,  Morgan 
Kauffman  :496-501. 

Shaw,  Michael  J.,  and  Gentry,  James  A.  (1988).  Using  an  Expert  System  with  Inductive 
Learning  to  Evaluate  Business  Loans,  Financial  Mangement  (Autumn):  45-55. 

Shim,  Soyeon,  and  Mahoney,  Marianne  Y.  (1991).  Shopping  orientation  segmentation  of 
in-home  electronic  shoppers.  The  International  Review  of  Retail,  Distribution,  and 
Consumer  Research,  1:437:453. 

Simon,  H.  A.  (1983).  Why  should  machines  learn?  In  Michalski,  R.  S.;  Carbonell,  J.  G.; 
and  Mitchell,  T.  M.  (1983b).  Machine  Learning:  An  Artificial  Intelligence 
Approach.  Palo  Alto,  CA:  Tioga. 

Stephenson,  Blair  Y.  (1989).  Critical  marketing  strategies  for  1990’s,  Journal  of  Direct 
Marketing,  3:34-41. 

Tam,  Kar  Yan,  (1991).  Applying  rule  induction  to  stock  screening,  in  proceedings  of  the 
Seventh  IEEE  Conference  on  AI  Applications.  Washington:  IEEEE  Computer 
Society  Press. 

Utgoff,  Paul  E.  (1989).  Incremental  induction  of  decision  trees.  Machine  Learning 
4:161-186. 

Valiant,  L.  G.  (1984).  A theory  of  the  learnable.  Communications  of  the  ACM  27:1 134- 
1142. 


BIOGRAPHICAL  SKETCH 


Chrysanthus  cle  Almeida  received  his  bachelor’s  degree  in  electrical  engineering 
in  1975,  from  the  University  of  Moratuwa,  Sri  Lanka.  He  received  the  master’s  degree 
in  business  administration  in  1988,  and  the  Ph.D.  in  decision  and  information  sciences 
with  specialization  in  management  information  systems,  in  August  1993,  from  the 
University  of  Florida. 

Chrysanthus  has  worked  extensively  in  radio  and  television  broadcasting  in  Sri 
Lanka,  prior  to  pursuing  graduate  studies  in  the  U.S.A.  He  has  held  several  positions  of 
responsibility,  which  include,  Director  of  Engineering  (Planning  and  Research)  and 
Director  of  Training. 

His  research  interests  are  in  the  application  of  artificial  intelligence  to  business 
problem-solving  and  in  the  strategic  and  competitive  use  of  information  systems  in 
organizations. 


128 


I certify  that  I have  read  this  study  and  that  in  tny  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 



Gary  J.  Koehler.  Chairman 

Professor  of  Decision  and  Information  Sciences 

I certify  that  1 have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Assistant  Professor  of  Marketing 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Professor  of  Decision  and  Information  Sciences 

1 certify  that  1 have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 

LJ  JA 

Antal  Majthay  ( 

Associate  Professor  of  Decision  and 
Information  Sciences 

This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  Department  of 
Decision  and  Information  Sciences  in  the  College  of  Business  Administration  and  to  the 
Graduate  School  and  was  accepted  as  partial  fulfillment  of  the  requirements  for  the  degree 
of  Doctor  of  Philosophy. 


Christopher^.  famszewski 


August  1993 


Dean,  Graduate  School 


